Quite a few of the bugs we see in production systems come from data that isn’t handled well. Perhaps the developer never considered this data, or another bug lets data into a system that should never be recorded. These are often NULL values, but they could be other data values that are far out of the ordinary.
Where do we draw the line for edge cases? Is it anything that doesn’t fit 95% of the data range? I see this number used in many fields, often manufacturing and other “physical endeavors”. Is it the 80% rule, where we ensure 80% of data cases are covered, but 20% represent special handling?
It’s an interesting thought because drawing this line helps us decide what level of data we need in our dev and test environments. We need enough data to represent what exists in production, but not much more. The less data we have, the faster everything moves, with much less friction in setting up, resetting, and moving these databases around.
However, the more bugs that slip through, the more we might need to ensure that we add more data to our development environments to mimic what is in production. Often we have used copies of production data, but there are plenty of issues with this. First, we often have less security in non-production environments and no shortage of data breaches comes from these systems. Therefore we might need to apply masking/obfuscation/pseudomization to values. Second, production databases are growing larger, often over 1 TB. While storage and bandwidth are cheap, they aren’t free, and moving around 1 TB of data regularly, or even restoring it, can present resource challenges.
My preference is a representative set of data from production, masked and without PII, along with some randomness that might catch edge cases before we deploy changes to production. With that in mind, what’s the edge case? I think I’d lean towards the 95% value, but ready to lower that if we discover many bugs.
How many are many bugs? I might apply the same standard. If more than 5% of bugs filed are data issues, we need better dev/test data.