There is an article about redacting data in a report done poorly. A consulting firm hired by Frontier Communications wrote a report and redacted lots of information. However, they apparently didn’t do a good job as all the information they blacked out could be read if the data were copy and pasted elsewhere. Something that’s much easier in digital reports than analog ones.
This was a PDF document, and I checked it. On page 25, there is this sentence: ” Annual capital expenditures for Frontier’s West Virginia local exchange carrier companies have averaged over XXXXXXXX for the past nine years.” The XXX is blacked out, but pasting it into a document shows this is a $70mm amount. It pays to know your tools.
This certainly isn’t a good technique for hiding information, but it’s not far off from what some people do when trying to mask or obfuscate sensitive production data in development environments. There are lots of cases where people use scripts that change data in one table, but not related data. Or that the changes are incomplete and don’t do a good job of ensuring there isn’t sensitive data leakage.
To be fair, this is a hard problem, and there are no perfect solutions. Anyone masking data likely needs to take a few passes at the problem, making adjustments over time to try and ensure that the data is protected from unauthorized disclosure. There also isn’t a perfect solution, as many researchers have found ways to reconstruct the original data after it’s been anonymized.
This is an area that I think is still somewhat immature, with relatively few best practices available for anyone to look at. While I have seen some guidance, I don’t see much on how one could verify they had done a good job. I hope we find ways to do better in the future, with more knowledge that helps data professionals ensure they are doing a good job. Otherwise, we won’t be able to protect data the way we want to protect it.