I saw this article a few months ago, which talks about engineers at Facebook not knowing where their customers’ personal data is stored. The engineers were being questioned in a legal matter, where they were asked to definitively state where all personal PII data for any human was stored by Facebook. Their answer was that they didn’t think anyone in the company would be able to answer that question.
Facebook has been controversial over the years, and plenty of people dislike the way the company conducts business. I noticed no shortage of data people (and many others) commenting on this situation, saying that Facebook should be shut down because they don’t know where data is being stored.
However, I don’t agree. In working with lots of customers, on all aspects of how they handle, process, and manage data, I expect this to be a problem in many organizations. Whether large or small, whether they have few or many software engineers, it is highly possible that there isn’t a good list of where personal data is being stored. As we work with customers to classify data with SQL Data Catalog, that process takes a long time, and very often the system administrators or developers who undertake take the task are unaware of all the places where data is stored.
That’s just in relational databases, ignoring all the Excel spreadsheets, text exports, mail merge operations, and uploads to services for mailing, analysis, or something else. Very often the control of personal data is fragmented among groups, with there being few efforts made to coherently manage a customer’s data.
The world has adopted computing at an incredibly fast pace, often by people with little knowledge or forethought of the implications of gathering and processing data. In many cases, probably most cases, there is no overriding strategy. Just like with applications slapped together quickly, we find data being gathered and stored based on the requirements and demands of business people, with no planning for management or archival, and often not even with any security requirements.
I liked the GDPR as a step forward, asking companies to not only handle data appropriately, but remove it when not needed, not use it without consent, and to be able to keep track and delete it if not necessary. I don’t know that this has been successful, but it has changed handling practices in some organizations. At least in responsible organizations, and many of them have had to track down personal data to delete it. I’m not sure they know where it all is, but I at least assume they know where all of the data about a person is in their various relational stores.
As a technical person, do you know where all data is stored about a customer? Are you sure you know where marketing has been keeping information and what other mailing, analysis, reporting, CRM, etc. systems they’ve put data? Any idea how many copies the operations group keeps? Test systems, QA, UAT, and others? What about test data sets, are they sanitized? Perhaps legal or finance has gotten extracts of data to reconcile their systems.
Tracking down all data can be hard, and I’m not surprised Facebook struggles. I would guess engineers in many organizations would have similar answers.
Listen to the podcast at Libsyn, Stitcher, Spotify, or iTunes.
As one who worked for a software vendor in the past I can speak first hand about how the data side of app development is often treated with less than it deserves. In this case the person at the top believed that because he/she could have any programmer versed in T-SQL in a few days (by having them read T-SQL in 24 hours and no it’s not a joke or exaggeration) so why should he/she spend extra $$ on an actual DBA type? Well eventually that out dated take on SQL had to go away because the DB got so messy (being that it was designed and updated/expanded by programmers who learned SQL in 24 hours) and code so slow. They were forced to bring on someone versed with SQL Server and another versed with oracle as those 2 DB’s were what the vendor used. I imagine this is not uncommon, that in order to save some $$$ the decision makers decide to allow programmers to handle teh DB side of things too.
Locating were all bits of personals data is stored isn’t hard in a properly designed and maintained DB but since many app developers don’t seem to respect the DBA side of things like they should many end up being of poor design and sometimes there’s not even a DDF (Data Dictionary File) or the like and you can’t always count on a naming convention to properly convey what some table.column stores/contains. In our case the vendor did provide us with a DDF but only %10 or less of it is documented, details what the table and or columns in said table are storing. If I was asked to locate where all personal data is I’m not sure I could %100 guarantee I could locate it all just most of it.
This isn’t within a db, but within an org. So across dbs, warehouses, etc.
Plus, within a db, who puts PII in notes fields in a table? Or some other repurposed space?
This is hard.
I do. I am very verbose with documentation and while this is a DB designed by a 3rd party vendor we are allowed to add our own custom objects and nothing get’s added where even 1 columns purpose is not detailed within the objects creation and or by using extended properties like the below from a table. I would hope that anything added to or modified within a DB Schema is fully documented today including if it’s personal info considering how important peoples data and it’s security has become. I can see why anytime before 2010 this would not have been common (regarding noting if it’s personal data) but it should be now and in the last 10 years; at least I think so.
EXEC sp_addextendedproperty N’Schema_History’, N’2008/03/04 (John Doe) Per request of Jane Smith, I changed the data type from VARCHAR(32) to NUMERIC(18,2).’, ‘SCHEMA’, N’dbo’, ‘TABLE’, N’UNITTYPE’, ‘COLUMN’, N’sField2′
Good luck. the rarity with which I see people documenting things or creating a dictionary compared with the requests for tools to do so is amazing. Everyone wants to document things, but I see few words being written in the stress of a development cycle.
Kudos for what you do here. I don’t see this often with customers
Thanks. I am very detail oriented. I know that I will not remember everything I’ve done since starting at my job let alone why things were done so every change/addition I make I detail the why, the who and the what. This has saved my bu8tt on more than 1 occasion where a higher up claimed one thing but I was able to refute him because I had documented in the code the who, what and why. It shocks me as to how many believe they can just rely on memory.
Everyone thinks their memory is great. As I age, I know I can’t. This is one reason we try to get people to use a VCS and comment on things they change, as well as link to some work item