I have made a career of working with SQL and databases. Usually I’ve looked for interesting companies and people, but I’ve avoided extreme situations. For me, that often is very large, or very real time environments. I once declined a job for a 13TB database on SQL Server 6.5. My suspicion is that job would have taken me away from my young children and wife far too often.
Facebook has a lot of users, and a lot of queries they run. With over 1billion daily users and hundreds of TBs of daily uploads, they really need strong databases. While they have multiple databases, and that includes SQL ones, they have struggled with analytic queries in the past. They started using Presto as a solution, an open source query engine for running analytic queries against data in different storage locations like RDBMSes or in something like Hive/HDFS. This sounds like what Polybase does for SQL Server.
The problem with any engine at Facebook’s scale is the load. While they like Presto, they needed to make it work better. They initially built a caching layer that required users to build ETL jobs to load data into SSDs attached to the Presto cluster. However, they outgrew this and ended up turning to a distributed file system called Alluxio.
The article linked above talks a bit about how this works, and allows users to query petabytes of data. Most of us have users that often don’t qualify their queries completely, so we expect that some queries that might need to scan 100GB end up reading much more until the users tune them appropriately.
The thing I found interesting in here is that some queries were taking up to 10s, which users found unacceptable. The move to Alluxio gave them a 30-50% boost, which doesn’t sound like a lot. 5-7s over 10 isn’t a great savings to me. The reduction in reads, is impressive, which is good, but I wonder to what expect there is some management and tuning needed here to ensure the cache works well.
I have no desire to work on these extreme systems, but I am glad someone does. The lessons and tricks learned here often trickle down to improve the daily performance many of us see in our smaller systems. I think that the Hyperscale work Microsoft is doing, and the Big Data Clusters, are fascinating ways of organizing SQL Server based servers, and some of that tech will likely trickle down and help us continue to improve our smaller systems’ performance over time.
Note: Podcasts are suspended for a week as I deal with the PASS Summit.