Do You Deal with UTF-8?

Microsoft is adding UTF-8 support in Azure SQL Database, and it will be coming in SQL Server 2019. If you don’t know what this is, perhaps you want to read a bit about it, as it can save space if you have the need to use Unicode characters. This format uses a variable number of bytes to encode characters, and this is often used on the web and email. My question today is:

Are you looking to store data in UTF-8?

The way this works with SQL Server can be complex. In fact, not everyone thinks this is really done well, as there are some bugs in the initial versions. As I’ve watched some people try to work with this, it is a very confusing and complex topic. I thought this might be a simple “SQL Server handles everything” collation, but it doesn’t appear that this will be the case. Calculating space needed for data isn’t as simple as I might expect. Not having to prefix strings with N is nice, but I’m not sure that this will actually work in practice.

I’ve seen some discussions of how to work with this, and it’s complicated. In fact, it’s not easy to tell how much storage you might need for characters. The storage differences can be confusing, depending on the code range you work with. Since most of us know that our users will try to add data we would never expect to our database, and we might run into issues with not enough space. For those of us specifying the size for our columns, we now need to know how many bytes are in use, not characters.

Likely this is easy for those of us that work in the English world and stick with varchar, but maybe not. I’m curious today how many of you will attempt to work with UTF-8 (or are waiting for it). It would also be good to know about any challenges or issues you’ve had working with the encoding in other systems or languages.

Steve Jones

Listen to the podcast at Libsyn, Stitcher or iTunes.

About way0utwest

Editor, SQLServerCentral
This entry was posted in Editorial and tagged . Bookmark the permalink.

3 Responses to Do You Deal with UTF-8?

  1. aaronbertrand says:

    In fact I found that you still have to prefix certain strings with N, since the string literal is interpreted before the value ever gets to the column collation. Try it, doesn’t work inserting into a column with a UTF8 collation, or even applying a COLLATE directly to the string literal:

    SELECT ‘宿’ COLLATE Chinese_Simplified_Stroke_Order_100_CI_AI_SC_UTF8,
    N’宿’ COLLATE Chinese_Simplified_Stroke_Order_100_CI_AI_SC_UTF8;

    Like

  2. way0utwest says:

    Interesting. Perhaps even less of a reason to use UTF-8?

    Like

    • aaronbertrand says:

      Well, I think the point is that if you were leaning toward UTF-8 simply to avoid the pesky N… even if you could eliminate it, I don’t think the other benefits justify it just yet.

      Like

Comments are closed.