There has been a rising tide of data legislation in the last few years that asks for organizations, especially private companies, to better protect their data. The GDPR is one of the most well known, taking effect with regards to enforcement earlier this year, and I’ve been doing quite a bit of work in relation to this law. There are plenty of other laws, such as California’s CCPA, Australia’s NBD, Japan’s APPI, and more that we ought to be aware of as data professionals. These laws affect personal data about people in a variety of ways, and they can affect how we process and use portions of the data we store.
It’s not as simple as it might sound to change our data handling practices. In fact, it might not be that easy for many of us to do this now unless we’ve actually done something in advance: we need to have classified our data. We need to understand the impact of the various columns in our tables, the exports of flat files or reports, and even the development processes that make copies of our production databases.
I’ll be honest, classification work is mind-numbingly boring and uninteresting. This almost feels like busy work to me, especially once we get past the obvious tax IDs and birthdays of people. When we start examining other data, the task feels like it ought to be delegated to junior staff, but many of them lack the experience to make the decisions. What can be more frustrating is that most of them lack the status to get others in the organization to respond to questions, which means the task ultimately falls on more senior people. This also means they often do it once and then forget it, leading to out of date information.
We don’t like doing classification, but we need to do it. Without having some mechanism that allows us to determine if data can be moved or used in another system/database/report/etc., we end up just ping-ponging around. We assume all data is sensitive and try to lock it all down. That leads to complaints, as well as staff working to circumvent the rules until they appear meaningless. At this point we might give up on controlling data and just trust people. That leads to audit problems, potential data loss from security incidents, and plenty of embarrassment about why we didn’t implement some simple controls.
Then the cycle starts again.
Classifying data is simple in some ways, but not easy to ensure the data is available, up to date, and easy to find for any size team. I’ve seen simple solutions that rely on spreadsheets. I’ve seen complex software packages that are expensive and cumbersome to implement with other applications. Microsoft has started to help with a few changes in SSMS, but this doesn’t seem like a long term solution, though SQL Server 2019 might help. Redgate has spent some time on this as well, thinking about the issue and we have an early access program now. All of these are partial solutions that might work for some organizations, but not all.
Ultimately, this is something like security, that we ought to be building into our systems from day one. Every proof-of-concept or prototype ought to be classifying data from the beginning. We won’t be perfect, and won’t get every label correct, but if we’re always thinking about the data, we can always correct our label and more tightly or loosely decide to handle data. I’d also like to think that if we conservatively label the data early, we’re unlikely to get into positions where we are mishandling data in a way that makes it more likely that we accidentally lose data.