Microsoft is adding UTF-8 support in Azure SQL Database, and it will be coming in SQL Server 2019. If you don’t know what this is, perhaps you want to read a bit about it, as it can save space if you have the need to use Unicode characters. This format uses a variable number of bytes to encode characters, and this is often used on the web and email. My question today is:
Are you looking to store data in UTF-8?
The way this works with SQL Server can be complex. In fact, not everyone thinks this is really done well, as there are some bugs in the initial versions. As I’ve watched some people try to work with this, it is a very confusing and complex topic. I thought this might be a simple “SQL Server handles everything” collation, but it doesn’t appear that this will be the case. Calculating space needed for data isn’t as simple as I might expect. Not having to prefix strings with N is nice, but I’m not sure that this will actually work in practice.
I’ve seen some discussions of how to work with this, and it’s complicated. In fact, it’s not easy to tell how much storage you might need for characters. The storage differences can be confusing, depending on the code range you work with. Since most of us know that our users will try to add data we would never expect to our database, and we might run into issues with not enough space. For those of us specifying the size for our columns, we now need to know how many bytes are in use, not characters.
Likely this is easy for those of us that work in the English world and stick with varchar, but maybe not. I’m curious today how many of you will attempt to work with UTF-8 (or are waiting for it). It would also be good to know about any challenges or issues you’ve had working with the encoding in other systems or languages.