There has been a lot of media attention to Hadoop in the last few years. In fact, Microsoft has spent a lot of resources to build the HDInsight version of the platform and integrate it into SQL Server. I’ve read quite a bit about how to setup and query with Hadoop, but haven’t used it for a real project. In fact, it seems relatively few people seem to be finding it to be a replacement for, or better solution than, SQL Server. We published a great introduction to Hadoop written by David Poole awhile back, and recently I ran across another nice writeup from someone I think is a very talented SQL Server professional.
Michelle Ufford (@sqlfool | b) wrote a piece asking if Hadoop is better than SQL Server. Michelle notes that Haddop is a different platform, and it’s a great way to consume lots of data. In fact, she has a graph from EMC talking about the data explosion and how we still at the low end of the exponential growth curve of data production. It’s a sobering thought and I tend to agree with Michelle and EMC on the growth of data.
I had hoped Microsoft would do more with Filestream and Filetable to help meet the challenges of large volumes of data, but it seems that very little has been done with those features in the last version of SQL Server. I have little hope that additional investment will come in the future. Instead, it seems Microsoft is leaning towards using Hadoop as one way to process and consume large volumes of data.
I wrote about Hadoop in 2009 when it was a young project, and I suspected it would enhance and work with, rather than supplant, the RDBMS. There are certainly other technologies out there to help with this, but if you are working with large volumes of data that exceed what a single instance of SQL Server can handle (at a reasonable cost), you might think about learning a bit about Hadoop. It might not solve your issues, but if it can, it would be good to know something about it.