Six or Seven years back, Hadoop was the big thing. It was going to solve our big data analytic needs, it would provide cheap storage and query power with commodity servers. More and more companies were going to be using it. Microsoft invested in HDInsight, SQL Server got Polybase to query data directly from HDFS. I was seeing the Hadoop elephant everywhere. I still remember popping into a few sessions at SQLBits to try and learn a bit more about how Hadoop worked.
In the last couple years, Hadoop has somewhat dropped off the radar as “the thing” that most companies need to become data driven and deal with large amounts of unstructured data. I found this analysis that talks about why Hadoop hasn’t taken over the world.
The short answer? Real time needs, cloud computing, and containers. I think really, though, the complexity of Hadoop became a problem. It was too hard for most companies to deal with, and with too few willing to invest in the large infrastructure needed and skills required to manage a system. I’d say the same thing about Kubernetes, but it’s evolving rapidly to become easier, and it’s dirt simple in the cloud. I suspect we’ll see more Kubernetes cloud deployments than on premises.
The other issue is the Hadoop batches jobs, which isn’t what many organizations want. They already deal with, and complain about, plenty of relational batch jobs. Whether this is ETL to a warehouse, cube processing, or some other delay. They want queries on data in place, which is becoming more commonplace all the time.
Of course, one other important point from the piece is something I believe. The relational database, or data warehouse, is not going away. It’s still important to many organizations, and it’s useful to handle lots of reporting. With the growth of the SQL Server platform, you might even do more AI/ML analysis on your data in place, without the need to move it to an HDFS platform.