en-UShe-IL
You are here:   Blog
Register   |  Login

Blog Archive:

Maximize
* Can be used in order to search for older blogs Entries

Search in blogs


Blog Categories:

Maximize
* Can be used in order to search for blogs Entries by Categories

Blog Tags:

Maximize
* Can be used in order to search for blogs by keywords

TNWikiSummit


Awared MVP 


 


Microsoft® Community Contributor 


Microsoft® Community Contributor


 Read first, before you you use the blog! Maximize
אוק22

Written by: ronen ariely
22/10/2019 14:29 RssIcon

Introduction

Azure SQL Database and SQL Server 2019 introduce full support for UTF-8 Encoding. This is a game changer for developers, but it came with a price! Old functions like CHAR() and ASCII() which were designed to be used on Extended ASCII, return strange result and might lead to unexpected result when we are using a COLLATE which use UTF-8 encoding. Moreover! the documentation at this time about the function ASCII is wrong and was not updated to cases that the user use UTF-8 COLLATE. 

In this post I will go over the function ASCII, explain how it was designed to work (as much as I understand it), show the problematic in SQL Server 2019 and Azure Database, and present a full explanation including some internals level information, which will help us better understand the topic.

You should read this post and understand what to expect when you use functions like CHAR and ASCII, or you might consider the result as unexpected by mistake. In other words if you know, that it is expected😀

Note: This post follows the discussion in this thread at MSDN forums.

The related question/issue which was raise in the forum, was that the function ASCII returns unexpected result. The OP expected that using the input "á" the function will return the Code Point 160 according to the Extended ASCII table which he found Online.

Instead the return Code Page was 225.

The discussion about this (simple) question expanded into general discussion about ASCII function, but first I will present the short direct answer to the original question.

The short answer for the original question

Extended ASCII is a Character SET which have multiple different Character Encoding. The Character Encoding is the algorithm which we use to encode the Code Points into Encoded Values (SET of zeros and Ones). Each Character Encoding is represented by Code Page (like a name for the algorithm). The Above image which was presented in the forum fits Code Page 437.

Note! In SQL Server we cannot directly choose the encoding. Instead we configure the COLLATE. The COLLATE is a general parameter which include multiple properties about how to use the text. One of these parameters is the Code Page or the Encoding algorithm name.

The only issue is that the OP used the COLLATE SQL_Latin1_General_CP1_CI_AS, which has the Code Page 1252. Using the Code Page 1252 this character is supported but it is not mapped to Code Point 160 as the image above show, but to the Code Point 225 which is the reason the OP got got Code Point 225.

In order to use Code Point 437 we can use one of the COLLATE that start with "SQL_Latin1_General_Cp437" like for example "SQL_Latin1_General_Cp437_CS_AS_KI_WI".

Important Note! SQL Server can support only Code Pages that are supported by the underlying operating system. For example using my laptop I can't use COLLATE that uses Code Point 437 since it is not supported on my Operating System.

So... In short: Extended ASCII maps the characters to Specific range of "Code Points" (0-255), but the mapping is done according to the Encoding. The function ASCII returns the "Code Point" but the "Code Point" depend on the "Code Page" and in SQL Server the COLLATE together with the Data Type configure the "Code Point". Therefore, using different COLLATE the function ASCII might returns different result😀. 

OK, now we have some more time so let's extend the discussion about the function ASCII a bit more, and I'll start from the basic...

Preparation

For the sake of the discussion let's create 3 databases which are configured to three different COLLATE.

CREATE DATABASE HebrewDB COLLATE Hebrew_CI_AS
CREATE DATABASE LatinDB  COLLATE SQL_Latin1_General_CP1_CI_AS
CREATE DATABASE UTF8DB   COLLATE Hebrew_100_CS_AS_SC_UTF8
GO

ASCII function - Documentation is wrong

Issue one

The Documentation speak about ASCII when it is actually means Extended ASCII.

As I learned during the writing of this post, using the term "Extended ASCII" each time is long and annoying. It is much faster to write ASCII instead. In fact, in SQL Server when we speak about ASCII then we probably mean Extended ASCII. For the sake of this post I will do the same. I will use the short term ASCII instead of Extended ASCII.

The problem in the documentation is that this comment does not exists and this lead people to think that functions like ASCII and CHAR supports only the Code Points 0-127 (ASCII), while in fact they support the range 0-255 (Extended ASCII). By the way, the encoded values of Code Points 0-127 for Extended ASCII are the same as fro ASCII, which is why I am OK with using the term ASCII in general.

Here is a quote of the mistake a user did as a result of using a wrong term. It was presented in the forum: 

"it only returns the ASCII of the char for those whose value is in the 0 to 127 range."

Issue two 

According to the Documentation (at this time), the ASCII function "Returns the ASCII code value of the leftmost character of a character expression".

This is simply wrong!

The function ASCII supposed to get input of an expression of type CHAR or VARCHAR. 

Historical speaking, before 2019 these data types only supported Extended ASCII, and using Extended ASCII each character is represented by Encoded Value of size one byte. Therefore, the number of bytes is the same as the number of characters, and the term "leftmost character" is the same as the using the "first byte" in the encoded string.

For example: If the input string is "abcde" then the first character is "a". The Encoded value of the entire string is "0x6162636465". This mean that the first byte in the Encoded value is 61 in hexadecimal which is 97 in Decimal. This is the Code Point for the character "a" in Extended ASCII. Therefore, in this case all seems fit to the documentation. ASCII function returns the value 97.

Note: you can execute the following query in any database that you want, since the demo uses only characters related to Code Points in the range of 0-127, which are the same in any Extended ASCII language.

SELECT ASCII('a'), CONVERT(VARBINARY(MAX), 'abcde')
GO -- 97, 0x6162636465

We can see that using Extended ASCII, the "code value of the leftmost character" is the same as the "code value of the first byte" => as much as I understand, this lead to the mistake in the documentation once we use Azure Database or SQL Server 2019, since these products support UTF-8 using the data types CHAR/VARCHAR.

While the size of each character in Extended ASCII is one byte, Using UTF-8 the encoded size of the characters in the range of 128-255 code points, is two bytes.

Here is a simple example using databases with different Collate:

USE HebrewDB
GO
select ASCII('á'), CONVERT(VARBINARY(2),'á')
GO -- 63, 0x3F
 
USE LatinDB -- SQL_Latin1_General_CP1_CI_AS
GO
select ASCII(N'á'), CONVERT(VARBINARY(2),'á')
GO -- 225, 0xE1
 
USE UTF8DB
GO
select ASCII('á'), CONVERT(VARBINARY(2),'á')
GO -- 195, 0xC3A1

Let's explain the results :-)

In the First result we used COLLATE Hebrew and this character does not exists in the Hebrew. Therefore, the batch parser converted the character "á" into "?", before actually executing the function. The function ASCII actually executed on the input "?". Since the character was replaced by Question Mark we got the code point of Question Mark which is 63.

The second result returns the Code Point 225, since this character is supported (exists) in the COLLATE we use, and this is the Code Point which fit the mapping of the character "á" for Extended ASCII character SET, using this specific encoding - Code Page 1252.

In the third result we use UNICODE UTF-8 (this demo works only on Azure Database and SQL Server 2019) and now all go to hell😕...

The function ASCII was not designed to support UNICODE but only for input of Extended ASCII🙄.

In this case the input which we used for the function was UNICODE even so we did not use N'á' since this is the encoding for the COLLATE we used and there was an implicit CONVERT by the batch parser.

How can I explain the issue with the Documentation?

When we use Extended ASCII then each character is always encoded into one byte. This function originally was designed to check only the first byte of the encoded value which using Extended ASCII is the same as "leftmost character", but in reality the function does not check characters but first byte. 

Using UTF-8 Encoding, the encoded value of the Code Points in the range of 128-255 uses 2 bytes. The function ASCII "treats" the first byte of the string like it was the entire encoded value of the first character, which is the issue here.

In the third example above you can notice that the Encoded Value of the character (using UTF-8) is 0xC3A1. The function ASCII uses as input only the first byte which is 0xC3. This Encoded Value is mapped to the Code Point 195.

And this is all the story... The ASCII function Does NOT "Returns the ASCII code value of the leftmost character of a character expression" when we are using COLLATE that supports UTF-8.

In order to make the Documentation correct I see two options:

1. There should be clear that this statement is only for Extended ASCII input - I don't like this solution since people will continue to be confused. After all, not everyone read my blog (for some mystery reason 😀)

2. Simply rephrase the statement to present the way the ASCII function really works behind the scene and say something like: "ASCII function returns the ASCII code value of the First byte of the Encoded Value of a character expression".

Summarize and Conclusions

The function ASCII is actually an Extended ASCII function. It returns the Code Point between 0 and 255 of Extended ASCII.

Another function which I mentioned is the opposite direction function CHAR, which get a Code Point as input and returns the character value in Extended ASCII.

Both these functions return the result according to the Code Page of the COLLATE you use. In other words, The result depend on the COLLATE.

If the input to the ASCII function is not in the right COLLATE, then you will get the result 63 which is the code point for Question Mark "?".

If the COLLATE support the character, then we will get the Code Point which represent the character in the specific COLLATE which we use.

And with this we can close the post...

I HIGHLY recommend to check the original thread in the forum.

I hope that this was useful,

Have a great day
- Ronen