I had been meaning to post this, so as I finished a piece that referenced this, I decided to post the picture. This was from Small Data SF, where the opening keynote referenced the Google BigQuery demo of a 1PB database.
Here was the slide shown later in the talk.
The thing the demo didn’t explain was that query cost $5,580. The conclusion, big data is just too expensive to query often.


You’re spot on. I’ve been dealing with huge analytics datasets for 15ish years and I can’t ever think of a single time I needed to query HUGE data in one single query. There’s just no need.
I’ve had people say, “but I have marketing analytics data on customers going back 20 years that I want to query to train a recommender ML algo.” No you don’t. My buying habits from even 2 years ago have no bearing from a targeted marketing perspective on my activity today. Ask any marketer. I’m not saying the data is valueless, I’m saying queries against all of that data _in the same query_ is irrelevant.
The analog is 20 years ago folks would ask during interviews “what’s the largest database you’ve worked on?” Typical answer: “1TB.” Response: “Wow, that’s big, you must be good at tuning performance.” Hogwash. The size of the db is irrelevant, it’s the working set size that matters. And 20 years ago I would’ve rather hired someone who answered something like “my db was 1TB, then they hired me and I looked at all the stupid modeling decisions from the last guy and I managed to squeeze it into 10 GB. Let me tell you how I did it. Then, I rolled back on the EE licenses to standard.” HIRED!!
Same thing with big data/big queries. It’s the working set size that matters. And that’s rarely as large as you think it is.
LikeLike
Most people don’t need a select * or an unbounded, all my data for the last decade query. If they do, it’s for a very rare and specific reason
LikeLike
Great article and love the slide. Thanks!
LikeLike
Thanks it was a one I really enjoyed hearing about
LikeLike