There’s a saying in the data world: garbage in, garbage out. We use that when we can’t get good information from our database because the data we’ve stored isn’t as useful as we would like. That’s a problem, and it’s one reason why data professionals want to spend time thinking about the data we need to collect and how to store it. We want to be sure that we’ve at least made an effort to collect useful data that someone will use.
We sometimes have the data we need, but still struggle to use it effectively. I think this is an area where machine learning and similar technologies may help in the future, but there is a lot of work to be done to allow most of us to take advantage of those tools. In the meantime, many of us make do with basic T-SQL to perform data analysis, generate reports, and provide the answers to questions. When we do so, it’s important that our queries actually work correctly to answer the questions we need.
I don’t want this to be a political discussion, and I would appreciate that any comments be limited to the technical subject. I ran across a piece about a failure of the US government in determining the status of people being checked for immigration status. The interesting quote in this article was “… officials blamed computer code for the problem.” Leaving aside the implications in this case, the idea that computer code, likely some sort of query code, is not working as expected, querying the correct data, or isn’t being used properly is disturbing.
I’ve run across quite a few stories like this from various consultants that were called in to help organizations, only to find out the queries that had been used for long periods of time were incorrect. They didn’t filter appropriately, didn’t convert or aggregate data as intended, or didn’t even query the correct data.
We use databases and queries extensively in today’s world, and the growth is only going to increase. As much as I like the idea of DevOps and more frequent deployments, I also want higher quality for our software. This means that we need to ensure that our queries actually work as intended against databases. Code reviews, independent checks, using known data sets that evolve and include edge cases of data are all ways we can work to ensure we are actually writing the correct queries for our data. Above all, we need to be sure we are using test of some sort, preferably unit tests, that ensure the queries are actually the ones we want.