Data Quality on the Open Web

Input from customers like this must include data quality checks

I used to hear that one of the strengths of Linux was the thousands of volunteers that would help you get a patch or a fix in record time when you reported an issue. That worked well, but not well enough for many companies that really wanted a company to stand behind patches. A few companies, like Red Hat, sold support agreements with the “free as in beer” OS that ended up costing companies almost as much as a regular license of another OS. While Linux is a great system, it hasn’t taken over the world like many people thought.

Lately there’s been a different flavor of open-ness on the Internet. It seems that so much of what we read and is pushed out to us as news or information is based on the crowd-sourcing of what’s popular. Facebook shows a “most active” view, Twitter has trending topics and re-tweets, and many news sites like Reddit use a crowd voting mechanism to help determine what you see first on their front page.

However there’s a downside to using these open systems. There’s the potential for abuse when a group of people get together. Google started using the open model on it’s map services to allow people to add businesses to maps. A very handy feature, butthe addition of a “mark this as closed” button allowed people to abuse this privilege. Whether it was competitors, vandals, or some criminal element isn’t known, but apparently the quality of data Google is providing on maps isn’t necessarily accurate. With many people using maps on iPhones and Android devices, this could damage businesses that add themselves to the mapping service. I think Google is playing a little fast and loose with their crowd voting on data points, but with so many companies looking to capitalize on the social networking phenomenon, I’m not surprised it’s being abused.

Whenever we build systems that take input from users, we have a maxim: garbage in, garbage out. Essentially we aren’t responsible for bad data, but many companies won’t feel that way. They will still feel that we ought to be better policing the data quality and not showing bad data in reports or downstream systems. As more and more companies look to incorporate data from customers into their systems, it becomes more important that data professionals incorporate automated scans and manual workflow checks before data moves from staging areas to prevent incorrect data from affecting our production systems.

Steve Jones

The Voice of the DBA Podcasts

Watch the Windows Media Podcast – 24.0MB WMV
Watch the iPod Video Podcast – 17.6MB MP4
Watch the MP3 Audio Podcast – 4.0MB MP3

This entry was posted in Editorial and tagged databases, security. Bookmark the permalink.

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30