Most of us that subscribe to this site are data professionals, and we work with large amounts of data for our organizations that is usually stored on a server class system, with TB sized, high performance storage systems. Whether on premises, in the cloud, or another data center, our employers have made an investment to provide high quality data services for our clients. This large investment is often a big decision, and setting up a new system to handle lots of data as an experiment with data analysis can often seem to take ages.
The R language has been popular for data analysis for years, though the data sets examined were often limited in size, usually because of workstation limitations. One of the reasons Microsoft added R services to the data platform was to move analysis closer to big stores of data and increase the ability of organizations to “operationalize” or deploy their analysis and models to a wider audience.
Deciding when to make that investment can be tricky, but the more that someone can prove some value from a smaller experiment, the more likely it is that an organization might decide to move forward. Recently, I ran across an interesting article, where the author had analyzed a billion row dataset on a commodity laptop. In this case, a Macbook Pro costing US$4000, but that’s a pittance compared to deciding to invest in HDFS storage, a Big Data Cluster, or even a large cloud experiment.
What caught my eye here is that the analysis tool used, OmniSciDB, was engineered to run on CPUs, not GPUs, and performed very well in analyzing the data. I haven’t found the time or set up the disk space to try and load the billion rows into a SQL Server columnstore index, but I’d be curious how that might perform on the same data. The queries run are fairly simple aggregations, and my guess is SQL Server would perform extremely well once the index was built. If someone else wants to try it and take notes, I’d love to read the experiment as an article on SQLServerCentral.
It has become more and more likely that before we embark on any large project in an enterprise that we perform some sort of prototyping and development on a small system. I think that’s true whether we’re building a web app or setting up a data science experiment that might drive our business forward. I always enjoy reading when someone has run tried a large scale analysis experiment on a workstation, not a server, and I hope we continue to see more people doing this and sharing their results in the future.
Listen to the podcast at Libsyn, Stitcher or iTunes.