en-UShe-IL
You are here:   Blog
Register   |  Login

Blog Archive:

Maximize
* Can be used in order to search for older blogs Entries

Search in blogs


Blog Categories:

Maximize
* Can be used in order to search for blogs Entries by Categories

Blog Tags:

Maximize
* Can be used in order to search for blogs by keywords

TNWikiSummit


Awared MVP 


 


Microsoft® Community Contributor 


Microsoft® Community Contributor


 Read first, before you you use the blog! Maximize
יונ19

Written by: ronen ariely
19/06/2016 04:47 RssIcon

Introduction

Since this question keep raising in the forums, and we do have news regarding SQL Server 2016, I found that this is a good time to re-write about these confusing concepts: character set, character code point, and character encoding.

Not! SQL Server 2016 and above added support for UTF-8 in some tasks, SQL Server 2012 and above added support for UTF-16 codepoint (range 0 through 0x10FFFF), and earlier version used only UCS-2 codepoint (range 0 through 0xFFFF) as Unicode.

I hope this is useful :-)
And as always, you can send me feedback and comments on Facebook.

character set, character code point, and character encoding

Text strings are created from characters. A collection of characters called a "character set". A character set, also known as a code-page or charset. It is an ordered set of characters in which each character is mapped to a numeric index, called a "code-point".

Character Encoding is the mechanism to map the code points into bytes. The code points are stored in the computer as one or more bytes (collation of 0 and 1). The visual representation of characters is configured by the font. A font is a collection of glyph definitions used to display characters.

Every encoding has a particular charset associated with it, but there can be more than one encoding for a given charset. The Unicode character set represents almost all the languages. There are multiple encoding for Unicode, for example UCS-2, UTF-8, UTF-16, UTF-32. Each of these has advantages for particular applications or machine architectures.

Example of Charset: Hebrew characters, English characters, etc. Unicode (a character set that include multiple languages).

Examples of encoding: ASCII, extended ASCII - There are many extended ASCII encoding for different languages, UCS-2, UTF-8, UTF-16, UTF-32.

The meaning of each code point can be different in every encoding. In order to correctly interpret and display text data (sequences of characters) using fonts, hardware and software that reads or receives the text must use the specific encoding that applies to it. Using wrong character encoding will lead to a wrong mapping of characters to/from the code point that is stored. That mean that a wrong code point will be use to represent the character and a wrong character will be used.

Homan <-> Computer

when we input text using a keyboard, each key or a combination of keys represent a character. The character encoding maps the characters to unique identifier or number named code point, which stored as bytes.

When we read characters we actually see fonts. Behind the scene the data stored as code points using a collection of zero and ones (bytes). The character encoding maps the code points to characters, which displayed using font.

More information

Example of several characters encoded as ASCII:
* The table represent ASCII mapping between the Code points and the Characters.

Code points

Characters

BIN
Computer uses binary

DEC

OCT

HEX

Font = Glyph
The way we display the char

01000000

64

100

40

@

01000001

65

101

41

A

01000010

66

102

42

B

01000011

67

103

43

C

01000100

68

104

44

D

01000101

69

105

45

E

01000110

70

106

46

F

       

Unicode Encoding

Minimum Bytes per Character

Maximum Bytes per Character

UTF-32

4

4

UTF-16

2

4

UTF-8

1

4

UCS-2

2

2

* By saying that the computer stores the code points, I do not mean that it necessarily stored as it is.  Different systems can store the data in different formats. For example, we can compress the data. Yet, the use of the data will be as explained here.

SQL Server

Microsoft SQL Server uses types like NCHAR, and NVARCHAR for example, in order to store Unicode data. SQL Server, uses UCS-2 encoding, for Unicode charset. UCS-2 uses 16–bit (two–byte) representation of each Unicode character. As such, it can reference only 65,535 code-points.

Any UTF-8 data sent from the client to the server via GET or POST is also converted to UCS-2 automatically.

You can easily store UTF-8 encoded text in the database, but then you don't store it as text, but as binary data (BINARY, VARBINARY, etc.).

You can manage UTF-8 encoded data using SQLCLR UDF. Microsoft provides example in this link: http://technet.microsoft.com/en-us/library/ms160893(v=sql.90).aspx

Collation

A collation specifies the bit patterns that represent each character in a data set. Collations also determine the rules that sort and compare data (for example case and accent sensitivity). For non-Unicode columns, the collation setting specifies the code page for the data and which characters can be represented. Make sure that you use the correct collate.

-- get a list of all the collations supported by SQL Server
select *
FROM fn_helpcollations()
GO


Support for import and export of UTF-8 data – New from SQL Server 2016!

Data stored in a UTF-8 encoded file can now be imported into SQL Server and exported from SQL Server into a UTF-8 encoded file, using BULK INSERT T-SQL command and bcp command line utility. You can read more in this blog.

* If you want to understand how SQL Server works behind the scene, then you must understand these expirations. In order to work with DBCC PAGE for example, you must be able to read the binary data of specific character. This binary data is encoded into UCS-2 encoding, for Unicode data like NCHAR, NVARCHAR for example.

* Unicode is a standard for mapping code points to characters of all the languages of the world. There is no need for different code pages to handle different sets of characters. If you store character data that reflects multiple languages, always use Unicode data types.

* Since the browser need to decode the binary data, it needs to know which Encoding we are using. This is done by adding a meta data to the page. For example: . The name "charset" is misused and confuses a lot of developers! This is not a character set but character set encoding.

Resources and More information

>> Character encodings for beginners
https://www.w3.org/International/questions/qa-what-is-encoding

>> ASCII Code - The extended ASCII table
http://www.ascii-code.com/

>> A tutorial on character code issues
https://www.cs.tut.fi/~jkorpela/chars.html

>> Character encoding
https://en.wikipedia.org/wiki/Character_encoding

>> Code page
https://en.wikipedia.org/wiki/Code_page

>> Description of storing UTF-8 data in SQL Server
https://support.microsoft.com/en-us/kb/232580

>> Add support for storing UTF-8 natively in SQL Server – Open connect from 2008! Microsoft respond at 2008 was " We are considering adding support for UTF8 in the next version of SQL Server."
https://connect.microsoft.com/SQLServer/feedback/details/362867/add-support-for-storing-utf-8-natively-in-sql-server

>> The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), by Joel Spolsky
http://www.joelonsoftware.com/articles/Unicode.html

>> Oracle: Selecting and Configuring Character Sets and Language Input and Output
https://docs.oracle.com/cd/E13292_01/pt849pbr0/eng/psbooks/tgbl/chapter.htm?File=tgbl/htm/tgbl07.htm

>> Code Page Identifiers
https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx

>> Collation and Unicode Support
https://msdn.microsoft.com/en-us/library/ms143726.aspx

>> Technical Overview: SQL Server 2016 Release Candidate 0
https://blogs.technet.microsoft.com/dataplatforminsider/2016/03/07/technical-overview-sql-server-2016-release-candidate-0/