A Good Use for Hadoop

Netflix is an amazing company. I’ve been a customer for over a decade, starting with the well known red envelopes that moved DVDs through the mail. Since then I’ve continued to subscribe as a streaming company and have enjoyed quite a few of their original programming efforts.

However I’ve also been fascinated by the way in which Netflix has made use of technology to grow a unique kind of company. They own very little, having used Amazon’s AWS to host their infrastructure and mostly serve content they acquire from contracts with various media companies. They are famous for their Chaos Monkey approach, having machines randomly fail to test the fault tolerance of their entire infrastructure. I read recently about the closing of their last, small data center, so outside of employees’ laptops, they keep all systems in the cloud.

However I ran across a post on the scale of their data flow, and it’s amazing. Apparently their events are generating 1.6PB a day of data. That’s incredible, and a scale at which very, very few of us will ever work. Personally, I think that’s interesting, but I’d prefer not to be working on a PB of changing data a day, but I might feel differently if I worked at a company like Netflix that has obviously been successful with dealing with data at that scale.

The post notes that they put this data into Hadoop, where it can then be queried. I’ve often wondered what the domains are for using Hadoop. Most of the people I know that have tried it aren’t really working at large scales. I think that a well built ETL process would allow their data to easily work in a SQL Server data warehouse, and possibly even the Azure SQL Data Warehouse. However at 1.6PM a day, Hadoop seems like a much better choice.

I’m curious how often that data is queried. Is most of that 1.6PB actually used in reporting, or is much of it lost in the shuffle and ignore? Is there a process to aggregate some of this raw data and then delete the older values? Can 1.6PBx365 actually be useful in analyzing your business? I’m sure some is, but wouldn’t a lot of those events lose their value over time?

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 3.7MB) podcast or subscribe to the feed at iTunes and LibSyn.

About way0utwest

Editor, SQLServerCentral
This entry was posted in Editorial and tagged , . Bookmark the permalink.

1 Response to A Good Use for Hadoop

  1. Hadoop really shines relative to a traditional Sql Data Warehouse when your objective requires you to perform an operation over every record in an enormous dataset that has a relatively complex schema.

    For example, the Enron corpus contains about 600,000 emails. Email data is a fairly complex mix of semi-structured and unstructured data. Emails have one sender, but they can be sent on behalf of another sender or group. Emails have one or many recipients, and recipients can be individuals or groups. Emails have a myriad of different dates. The body of the email is typically unstructured text and or HTML. Emails have zero to many attachments that themselves can be other emails. Emails can also have a web of branching thread hierarchy.

    Organizing email data into a relational model would have significant impedance mismatch. Tables, facts, and dimensions, all have to be considered at the time you build the warehouse. Then querying every record and relationship would require a massive server, not to mention deep pockets and lots of patients to process the data.

    On the other hand, if you forgo the relational data restructure problem and went with a NoSQL type schema you would likely get the data into a queryable state much faster. Combine this with something like Hadoop, you won’t have to consider the schema of the data at the time you populate it. Every query (e.g. map reduce job) deals with the schema when the data is read. Any schema not used by the job is basically ignored.

    Now, it’s very possible to fit all 600,000 emails in the Enron Corpus on one server and process it. After all, you can do NoSQL stuff in SQL Server with something like XML. However, the complexity of some of the queries might bring the rate of processing each email down to 10 emails per second on one machine. That means every run of the query will take over 16 hours (600,000/10/60/60). This is where the power of a distributed processing environment can improve things. Running the same job on data spread across 10 machines gets the time down to about 1.6 hours.

    It’s now important to note that the 600,000 emails were collected from only 158 people (over several years). Imagine the complexity and scale of trying to perform an Enron type of analysis of a few thousand employees. I’ve done engineering work for companies that have over 100,000 employees. That would be really big data.

    Email data is only one of many great Hadoop-able problems. Hadoop is also great at processing seismic data, well logs, server logs, click streams, and any data processing that requires a full scan of every record in a complex schema of an enormous dataset.

    Hope this helps.


Comments are closed.