Machine Learning Challenges

If you know someone well, you may think that you can predict what they’ll do in a situation. Those that know me might think that if I go to New York, I’ll make time to run in Central Park. That’s a good bet, and the last 7 or 8 times I’ve gone there to speak, that’s what I’ve done. However, that’s a very small data set, and you might not want to attach a high level of confidence to that prediction. There are certainly times that I’ve gone to New York and not run in Central Park.

Machine Learning (ML) is a field that tries to make predictions about some data, based on previous known actions or results. By inferring relationships between lots of data and known actions, the ML model can guess what might happen with future data inputs. Lots of companies and fields are experimenting with ML techniques, with some of them working quite well. Others aren’t faring as well, and there is a lot of research underway to better understand the entire field. One possible explanation is emergence, which is discussed in this piece.

If it’s raining, I likely won’t run. If it’s 20F, but dry, I will. Those that know me would inherently consider the type of weather as a new factor that has a high level of predictability for me. Time in the city, location, whether my family is with me, these are all other factors that would come into play. With enough data, and enough factors, you could likely come up with a model that predicted the likelihood of me running in Central Park when I visit New York City. It might not be perfect, but it would be good if you had enough data.

However, that’s a very simple item, and few decision points. Would that work for a larger data set? I bet it would. If we were able to collect data from lots of 5k events, we could perhaps predict how many people would register show up. There are certainly more factors at play, and since we can’t control things like weather, the prediction might not be that valuable. Though if we could change the amount of resources the day of the event, maybe that would help reduce costs or make the event more successful.

I think ML is a valuable tool that can help with quite a few problems, but in many cases, we just don’t know enough about the complex factors, such as emergence, that might impact us. I know our SQL Saturday organizers would love a model that helped them decide how much food or drinks to order for an event. My view is that there are likely too many unusual factors, but I wonder if we could start a data gathering project to come up with a model. Perhaps we should gather marketing metrics, registration metrics, and more, with the intention to predict the number of no-shows. Even if it wasn’t useful, it might be a fun project for people in this industry.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 3.6MB) podcast or subscribe to the feed at iTunes and Libsyn.

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30