Remember the N

Another post for me that is simple and hopefully serves as an example for people trying to get blogging as #SQLNewBloggers.

I saw a post recently from someone that was having trouble with Chinese characters being inserted into a SQL Server table. I’m not sure if they were using an application or just SSMS, but they inserted this:

insert into mytable select ‘<IDC>亮块(DS3或DS4)-清理/维修显像滚筒</IDC>’

And they got this in the table:

<IDC>??(DS3?DS4)-??/??????</IDC>

That’s a problem that I can see, even if I can’t read Chinese.

Someone else posted a note that when you insert, you need to let SQL Server know your string is Unicode. That means prefixing your string with an N’.

I looked in Books Online, and found this note under the nvarchar section:

“Prefix Unicode character string constants with the letter N. Without the N prefix, the string is converted to the default code page of the database. This default code page may not recognize certain characters.”

I think this means that the N’ isn’t necessarily required, but it depends on the default code page of your database. For most of us, if we don’t include it, I believe our data gets converted to UTF-16, which might not be what we want.

SQLNewBlogger

I ran across the post and spent 5 minutes researching things and looking in BOL. This really took about 5 more minutes to write.

Reference

NVarchar –  https://msdn.microsoft.com/en-us/library/ms186939.aspx

About way0utwest

Editor, SQLServerCentral
This entry was posted in Blog and tagged , , . Bookmark the permalink.

One Response to Remember the N

  1. Solomon says:

    “I think this means that the N’ isn’t necessarily required, but it depends on the default code page of your database. For most of us, if we don’t include it, I believe our data gets converted to UTF-16, which might not be what we want.”

    Hi Steve. You are definitely on the right track here, but the situation is not to so simple ;-).

    There are two main factors that determine how to proceed with this:

    1) The datatype of the field being inserted:
    If the datatype is XML or an N-prefixed type, then that field holds only Unicode encoded data. While there are several Unicode encodings — UTF-8, UTF-16 {Big, or Little, Endian}, UTF-32 {Big, or Little, Endian}, etc — SQL Server only stores Unicode data in UTF-16 Little Endian. And when dealing with string literals, they need to be prefixed with capital “N” (lower-case does not work) to indicate that what is inside of the single-quotes is, in fact, Unicode data. And just to have it stated to reduce confusion, there is only one set of characters in Unicode: a Unicode code-point should always be that same symbol regardless of it being stored in UTF-8, UTF-16, UTF-32, etc. So, if inserting data into an XML or N-prefixed type (NCHAR, NVARCHAR, or NTEXT–deprecated in favor of NVARCHAR(MAX) but needs to be stated), from a string literal, then you definitely should prefix the that literal value with a capital “N”, even if the string only contains code points from 0 to 127 (those are the same in nearly all encodings). Always prefixing string literals going into XML and N-prefixed fields and variables is a good practice since it gets one into the habit of doing it, and it avoids the implicit conversion to NVARCHAR.

    If the destination datatype is CHAR, VARCHAR, or TEXT (deprecated in favor of VARCHAR(MAX) but needs to be stated), then it wouldn’t matter if the sting was prefixed with a capital “N” as the value will still be converted to the code page of the field (code page is a property of the collation of the field), or the code page of the database if storing into a variable (code page is a property of the default collation of the database).

    2) The encoding of the source, and default collation of the database:
    What is the data supposed to be? This isn’t really an issue if the source is either Unicode or only contains characters having values between 0 and 127. But if the source is non-Unicode, then the code page of the database might need to match the code page of the script. The reason is that the symbol will get translated into the code page as specified by the default collation of the database, before anything is done with that string, and the code page of the DB might not have that character. It is important to note that it is the symbol itself that gets translated, not the binary value of it. This is why we get “?” instead of a mis-translated value, even if the value of the original symbol is between 128 and 255 and has a symbol, even if a different one, in the code page of the DB. Meaning, ASCII value 238 (or 0xEE if you prefer hex values) is the ” î ” character in Code Page 1252 (Latin1), and it is also the ” מ ” character in Code Page 1255 (Hebrew). If the script is using the ” מ ” character but the default collation of the DB is Latin1 AND the literal is not prefixed with an N (even if it is prefixed with an N it is not guaranteed to work, but for a different reason), then it will return a “?” instead of ” î ” since it tries to find the ” מ ” character in the Latin1 Code Page (that character doesn’t exist there) instead of finding the equivalent symbol for value 238 in the Latin1 Code Page.

    All of this means that everything really needs to match up. Even if one prefixes a string literal with a capital N which retains the value (it actually translates it to a Unicode Code point which is most likely a value above 255), if the destination field or variable is non-Unicode AND that symbol doesn’t exist in the destination Code Page, then you will still end up with “?”.

    This is why the recommendation is to simply use Unicode datatypes (XML, NVARCHAR, and NCHAR) as there is no translation: it transports all 1.1+ million code points rather easily between systems regardless of Collation settings (assuming SQL Server to either SQL Server or any system that can handle UTF-16 Little Endian).

    The following test shows the points stated above in action:

    DECLARE @Test TABLE
    (
    [Source] NVARCHAR(50),
    [Value] VARCHAR(10) COLLATE SQL_Latin1_General_CP1255_CI_AS
    );
    INSERT INTO @Test ([Source], [Value])
    VALUES (N’ASCII code for mem: מ’, CHAR(238)),
    (N’ASCII literal mem: מ’, ‘מ’),
    (N’Unicode literal mem: מ’, N’מ’),
    (N’ASCII literal i with Circumflex: î’, ‘î’),
    (N’Unicode literal i with Circumflex: î’, N’î’);

    SELECT [Source],
    [Value],
    ASCII([Value]) AS [ASCII value (238)],
    UNICODE([Value]) AS [Unicode value (1502)],
    CONVERT(VARBINARY(2), [Value]) AS [Byte value (0xEE)]
    FROM @Test;

    You will see different results between running this in [tempdb] vs a new database created with a collation of SQL_Latin1_General_CP1255_CI_AS. The “Unicode Literal mem: מ” works the same in both since the literal is prefixed with a capital “N” AND the collation of the field contains that symbol. However, neither of the “literal i with Circumflex: î” entries work in either database because the destination code page cannot hold that symbol.

    I hope that all makes sense :-).

    Take care,
    Solomon..

Comments are closed.