When Should We Be Cleaning Data?

I was reading Grant Fritchey’s update from the Business Analyst Conference  and noticed this quote: ” There were lots of sessions on the need to clean data, which implies that we need to do a better job at data collection in order to support this wild new world.” Grant was talking about the fact that many of the sessions seemed to imply that it’s mostly the processes around data that cause issues with data, not the people.

However is that really what we should do? Do we want to do more and more cleansing at the point where people enter data? I’m not so sure that’s the case. The more that I deal with various applications and forms, the less I want to see too many fields and values required. Certainly there are pieces of data that are necessary for action, but I often find there are lots of additional fields that analysts want, but are more of a burden than a necessity.

Most of us as data professionals design tables to handle the needs of an application. We put fields in tables that we expect to fill with data, based on a specification. However the real world is messy, and often the data we want to collect and store isn’t always available. The question is, do we force failures in the application or do we handle missing data?

I don’t want to eliminate DRI, or go to an EAV models for lots of data. However I do think that we need to ensure we allow defaults and work with developers to allow data in our systems that might not be complete now, but perhaps will be in the future. We should work with analysts to help them build reports that can handle defaults or missing fields. We can work with developers to allow applications to request updates to data later and then design ETL that can efficiently fill in the updated information.

Applications and software need to be flexible to work with problematic data. We have the ability, as data professionals, to help our clients still find meaning in data that might not be as complete as we’d prefer, or they’d prefer. However we can still find valuable information in what data they have..

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.5MB) podcast or subscribe to the feed at iTunes and LibSyn.

About way0utwest

Editor, SQLServerCentral
This entry was posted in Editorial and tagged , , . Bookmark the permalink.