Anonymisation Confusion

The GDPR starts getting enforced in a few weeks. It’s been law for a couple years, but the authorities have given companies time to comply. I know various entities are frantically working towards compliance as I keep getting updates to Terms of Service. My company is among them, and we are diligently ensuring we can prove that we aren’t violating any rules. That’s good because I’m sure fines will reduce any bonus we might earn this year.

As I’ve been reading over the law and talking with customers, I’ve learned quite a bit. Redgate Software builds products to help with compliance and we’re updating guidance and information on how to work with data. As I’ve helped to update information and explain concepts I keep running into the term “pseudonymisation”. If you listen to the podcast, you’ll probably hear me struggle to pronounce it, but more importantly, I was initially confused about what this actually meant.

The Data Protection site from Ireland has a great description of how this differs from anonymisation. You can read through the document, but anonymisation means that the data can’t be some how reverse engineered to find the original data. In terms of privacy, an anonymised set of my data wouldn’t allow anyone to determine the data is about me.

If the data is pseudonumised, data about me would be replaces with a token, but there might be other means of discovering a data set is about me. As an example, in an eCommerce system, you might have an order key in a dev data set that’s copied directly from production. However, my name would be replaced with something else, like Bob Smith. It’s not apparent that it’s my data, but the protection is limited. If the data were anonymised, the order key would be replaced as well to prevent reverse engineering.

Many of us have gotten used to being lax with dev and test data, often just restoring from production. It’s handy, convenient, and allows you to find problems in production or verify changes using known values. The downside of this is that we have poor data security. There are no shortage of data breaches from dev and test systems. Certainly plenty of data has been lost from developer laptops as well. Even if you had your laptop encrypted, there’s no real excuse for using real data in less secure environments.

We need to learn to use anonymised data, and become comfortable with the idea of working on secure data sets. That also means we need the skills to ensure we build good, useful, valid datasets with production-like characteristics.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 4.2MB) podcast or subscribe to the feed at iTunes and Libsyn.

About way0utwest

Editor, SQLServerCentral
This entry was posted in Editorial and tagged , . Bookmark the permalink.