It’s no secret that anonomizing data doesn’t always work well. We have heard about this when Netflix released their data for people to build algorithms with. Some people were identified based on the data released being correlated with other data the people had entered on the Internet themselves. I know that there are dangers with sharing too much information on the Internet, but people are going to share and there will only be more services in the future for us to use that require data.
I ran across a post recently from Microsoft researchers that showed similar issues with other anonymous data sets that contain IP information. A number of logs containing traffic from Bing and Hotmail were analyzed with the intention of identifying particular hosts. Even when the data was anonymized, it was possible to identify hosts with a high degree of accuracy.
You might not think this is a big deal, but as more data is gathered by companies and used for secondary purposes, like analysis, it becomes more likely to be inappropriately released. Is a log on a server more secure, or a copies of multiple logs on analysts’ laptops? I’d think the former, or at least I’d hope the former. If that’s true, then we should really be anonymizing data on a regular basis once it leaves hardened server machines.
That means we ought to have better algorithms for preventing any identifying information from being retrieved. I would hope that this is an area where research can help, and one that receives a lot of attention in the near future.
The Voice of the DBA Podcasts
We publish three versions of the podcast each day for you to enjoy.