Netflix is an amazing company. I’ve been a customer for over a decade, starting with the well known red envelopes that moved DVDs through the mail. Since then I’ve continued to subscribe as a streaming company and have enjoyed quite a few of their original programming efforts.
However I’ve also been fascinated by the way in which Netflix has made use of technology to grow a unique kind of company. They own very little, having used Amazon’s AWS to host their infrastructure and mostly serve content they acquire from contracts with various media companies. They are famous for their Chaos Monkey approach, having machines randomly fail to test the fault tolerance of their entire infrastructure. I read recently about the closing of their last, small data center, so outside of employees’ laptops, they keep all systems in the cloud.
However I ran across a post on the scale of their data flow, and it’s amazing. Apparently their events are generating 1.6PB a day of data. That’s incredible, and a scale at which very, very few of us will ever work. Personally, I think that’s interesting, but I’d prefer not to be working on a PB of changing data a day, but I might feel differently if I worked at a company like Netflix that has obviously been successful with dealing with data at that scale.
The post notes that they put this data into Hadoop, where it can then be queried. I’ve often wondered what the domains are for using Hadoop. Most of the people I know that have tried it aren’t really working at large scales. I think that a well built ETL process would allow their data to easily work in a SQL Server data warehouse, and possibly even the Azure SQL Data Warehouse. However at 1.6PM a day, Hadoop seems like a much better choice.
I’m curious how often that data is queried. Is most of that 1.6PB actually used in reporting, or is much of it lost in the shuffle and ignore? Is there a process to aggregate some of this raw data and then delete the older values? Can 1.6PBx365 actually be useful in analyzing your business? I’m sure some is, but wouldn’t a lot of those events lose their value over time?