I was listening to someone talk about data privacy recently, and the ways that you can protect the sensitive information in your databases. They had a great quote about something you might consider. They said, “The best way to protect data is not hang onto the raw data at all.”
If we don’t have sensitive data, then a loss of data can’t occur. Hacks won’t cause issues, we can’t accidentally send out data or leave it lying around. There’s a good case to be made that keeping less sensitive data around is a good idea.
For some applications, we can’t avoid keeping sensitive data. Medical databases keep private health information. E-commerce systems likely need financial information. Many of us will definitely have to deal with some sensitive data, and protect it, but we can minimize our struggles.
We often don’t have a good reason for keeping lots of data around. Lots of queries run by users end up looking at only a small portion of data. Often recent data is needed, and some aggregates for older data, but we don’t actually look at the details of old data often. We may even have older data around that we’ve forgotten about, and our users don’t even know is available.
We certainly don’t often need sensitive data in non-production environments. Plenty of people use scripts or tooling to obfuscate, anonymize, generate, or otherwise ensure sensitive data isn’t in unprotected environments. We can archive, or even change, old data to ensure it isn’t a liability. We can even do this in production, preserving metrics, but delinking data from any individual.
I’ve always been someone that kept more data than necessary, just in case I needed it. However, over time, I find that the costs, and the potential risks, just aren’t worth it. Moving forward, archival, anonymization, and other strategies need to be a part of any system I manage.