How large is your analytics system? Do you have more than one machine for analytics? Do you have a cluster of machines that run Hadoop in a YARN cluster to analyze your data? Are there hundreds, or even thousands, of nodes that are being used regularly? Some of you might have what you consider to be a large system, but I bet it isn’t as large as Microsoft’s cluster.
They think they have the biggest YARN cluster, with over 50,000 nodes in a single cluster. This is used to process multiple exabytes of data from their various properties and systems. I certainly haven’t heard of a system this large, and I really wonder what this costs to run. After all, I’d think a 50,000 node cluster has to be a significant cost, though perhaps in the grand scheme of Microsoft’s $100 billion in revenue and $38 billion in expenses, even 100,000 machines can’t really impact their numbers.
The cluster has essentially been running a private version of Azure Data Lake for years that their internal developers and analysts use to access a common pool of data. In fact, because of their scale needs and the desire to limit the copying of data between clusters, they have contributed back to the Apache Yarn project a number of fixes to help ensure the software can scale to tens of thousands of nodes. There is some discussion of how they’ve allowed YARN to grow to larger scales, and it’s an interesting solution that essentially allows some overbooking of resources, knowing there are always some spare cycles available for processing data. It’s a great test site for Azure Data Lake, and something that more of us might use in the future.
I doubt may of us would need to work on data sets that large, and I know I certainly wouldn’t want to be responsible for that much of a data lake, I do think these are interesting problem domains that someone should look at. Certainly there are always large organizations and governments that have ever growing pools of data that will likely end up in a data lake of some sort. And who knows, perhaps, the definition of large will continue to grow to the point where 1,000 nodes in a cluster is considered “small”, and it’s what many of our businesses might implement in the future.