Building Test Data

One of the debates I’ve seen over the last few months is about test data in development environments. As I’ve been preparing for and learning more about the GDPR, it seems that many companies are concerned about holding sensitive data in their development systems. I think it’s a valid concern, and I’ve often had to deal with this issue in the past, before any regulation impacted my work.

In one of my early jobs, we stored emails from customers in a table. We also had an email feature for our application. Needless to say, we needed to test that, though we didn’t want to obviously send emails to real customers when we were testing some sort of feature. I’ve done that and usually it results in a complaint and some scolding of the development staff. As a result, I learned to ensure that anytime we restored production to our QA system, we ran a script that would either change all emails to invalid values, or reset them to something we could use in a test system. In some cases, we’d reset them all to a specific address that we could check to see if the emails actually were sent.

In talking to many people, they often don’t build test data for development systems because the data isn’t valid. What a developer thinks about, or what might be randomly generated by some utility often isn’t seen as valuable. Most developers want to see real data, perhaps because they can then better relate their work or a specific feature to the actual live system. Maybe it’s easier to see actual customers, products, accounts, etc. when working with clients or testers, but I do think that certainly in light of the GDPR and other regulation, there are risks here.

While many people want to just restore production to refresh environments, I do think it’s a poor idea to use actual sensitive data. Even if you trust your developers, there have been no shortage of attacks against development systems, loss of laptops or other files with production data that were intended for developers. We just don’t secure test and development environments like production, and that means we are making a fundamental error in how work habits.

I’ve gone through different views on this topic across the last couple decades, and now I want some real data, some not real data. What I’d really want is a bunch of random data that is close to production, mimicking the shape and skew of production, but without any sensitive data. Then what I’d like is a set of known cases of data that are the types of data that we need to ensure works in our system. Various cases of transactions and values designed to cover the functional edges that we support.

Of course building these sets isn’t’ always easy, and it’s never going to be done. As long as we write software, we will need to maintain tests and data alongside the code. I do think this is a worthwhile investment in regulated industries, and likely worth doing in all industries. The thing is, it’s not interesting or fun, and likely not to be ever be done in most organizations. I’d like to change that, and I hope you do as well.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 4.3MB) podcast or subscribe to the feed at iTunes and Libsyn.

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30