I love baseball. In fact, for the last ten years I’ve participated in an adult league each summer, playing 12-20 games each year against other 40 year olds. While we play the same game with the same rules as the professionals, we have a much different outlook on the game, with many of us enjoying the competition and distraction, but often viewing avoiding injury as a higher priority than winning.
Baseball is an interesting game as the professional season is so long and for over a century there have been a huge amount of statistical data captured about each team. Prior to the fantasy sports popularity explosion, people had leagues that competed against each other solely based on this statistical data. I even wrote a program with a friend on an Apple II that allowed us to set up teams of random players, read in statistics, and then compete against each other using pitching v batting statistics, just showing us text results.
I ran across an interesting post from a consultant that examined a hypothesis he had about winning baseball teams playing better at the end of the season. Using a free data set, the R language, and a Jupyter Notebook, he was able to perform an analysis that proved he was right. Whether or not you agree with him, I would guess that this analysis didn’t take that long to complete, and was fun as well.
With R coming to SQL Server and the increasing call from organizations to perform more complex analysis on our data, I think this is a great way to get started. In fact, I’m thinking that I should load up some of the datasets myself and see what kinds of things I can prove or disprove about my knowledge of the game. If nothing else, looking at baseball stats is probably a fun way to brush up my statistics skills.