Representative Data Challenges

One of the areas where machine learning and artificial intelligence have had lots of success is with image work. Whether identifying people in pictures or helping cars stay on the road and out of each other’s way, this capability of computing has worked well. It’s not perfect, and not necessarily as accurate as most humans, but it works well. At least well enough. Sometimes it’s even better than humans.

There are issues, however, and I think some of them are because of poor data sets. Last year when the pandemic hit, education was challenged with how to conduct remote exams. While there are some solutions, they don’t always work well. Sometimes the algorithms don’t recognize people, especially non-Caucasians.

The issues raised reminded me of the issues with some bathroom gadgets. I have fairly dark skin, and I’ve always wondered why some sinks and soap dispensers wouldn’t work for me. I hadn’t thought much about it until I saw a few reports like the one listed above.

I don’t think there is anything malicious here, but I do think that often we find teams work on a happy path when building some new tool. They test it often themselves, but they don’t think widely about how a variety of customers will use things. While I’ve seen many personas, I often don’t see anyone creating personas that might consider something like skin color, or even a different culture. We often consider roles, without deeply examining how those roles are implemented.

We need to work with representative data in whatever area we work, but data that does include some of the edge or corner cases that might come up. Our dev and test areas can start with small data sets, including those that we build, but at some point we need representative data. Whether we’re building OLTP software, sensors, or image recognition, our data should be well rounded.

While systems don’t need to solve every issue, we ought to consider a large percentage. In the case of imaging, certainly understanding the wide variety of type of people that can use products would seem to be important. Hopefully future teams won’t make the mistake of assuming that most of their customers look exactly like them.

Steve Jones

Listen to the podcast at Libsyn, Stitcher, Spotify, or iTunes.

About way0utwest

Editor, SQLServerCentral
This entry was posted in Editorial and tagged . Bookmark the permalink.