Data rich and information poor. I think that describes most of the companies I’ve worked for. It’s a theme in this piece from Forbes, noting that many companies use only a fraction of the data they have to make decisions. Certainly I think that many of us that work as data professionals might note that most queries access a portion of our data, often the newest data, with older data sitting on storage systems, constantly powered on, but rarely included in a report.
Retrofitting archival processes into an existing application can be hard. Even more so when there is rarely queried data that clients want to be sure is still accessible somehow. The Stretch Database feature in SQL Server 2016 might help, but I bet it’s a long time before most of us have all our systems on SQL Server 2016 or later versions, let alone finding someone to pay for this feature.
There are real costs to keeping this data around, first and foremost of which is the stress for us as developers and DBAs as we try to tune queries the must run against larger and larger data sets. Actually, I’m only somewhat kidding. Management and clients might not care about this, but having to work against larger and larger can be stressful for technical professionals.
There are other, more concrete and measurable costs to keeping this data around. The cost of power and larger storage systems. With many companies keeping multiple copies of production systems around for different purposes, these can be noticeable costs. There’s also the time factor. If our systems run just 10% slower, that’s potentially 10% less business we can handle. Or maybe all that extra data means more annoyance and frustration from our customers due to slow systems.
We are going to get more and more data in our systems. While much of this data may be useful, if we’re overloaded, we may not be able to take advantage of the information. We also might get erroneous results if we don’t recognize that data gets old, and the value we might have from a row today might not exist in a few years. We should also realize that at times we have lots of data that isn’t useful at all for our organization.
I really would start thinking about the ways in which we can actually remove older data from our systems, with archival to cold systems, or even deletion if we’ve moved copies of data to other applications, such as data warehouse systems. Or maybe just deleting data we know isn’t going to provide any information. Above all, remember that warehouses will fill up at some point, unless you buy more and more (expensive) storage. Keeping all data accessible might not be the best decision for your organization.