I was watching a talk from Google on how they back up their data. After all, one of the biggest assets Google has is the tremendous amount of data that they have collected over the years. There are many systems at Google, and certainly lots of data in each of them. In this case, Gmail was the system being discussed, and the data is in the exabyte range. If you don’t have an hour to watch the talk, then there’s a nice summary at highscalabilty.
In the talk, there are certainly things that Google aims to do with their approach. First, they can’t lose data. That’s a priority, as it should be for all of us that manage data. They also focus on restores, not backups. In fact, if they can make restores easily by adding work and complexity to backups, that’s a trade-off worth making. I haven’t typically viewed the restore process this way, though I do think restores are ultimately the most important part of any recovery task. However, I haven’t really thought about how I could actually make restores easier by changing something at backup time. I’ve often tried to make backups quicker, or take them more often, but perhaps this is an area to re-examine. Are there things you can think of that would make restores easier? Maybe not easy enough for your cat to kick off (as discussed in the video), but easy for the average sysadmin at your company?
Google wants redundancy, which includes people. They can’t depend on any one machine, one tape, or one person. Therefore, they need to have multiple copies of data and more automation that reduce those single points of failure. Along those lines, our clients and customers don’t need to know if we have 3 copies, 7 machines, or any other configuration. Our responsibility is to ensure our customers can access data.
Why should we care what Google does with GMail or any of their systems? Well, I only see our databases growing, with sizes going from GB and millions of rows to TB and billions of rows, or even to PBs. There are lessons that we can learn about the management of data at scale, and the ways in which our customers might perceive the availability and accessibility of their information. Google has learned they need to be more efficient with resource usage. Whether that’s disks or people, they can’t require 1000 times more resources for 1000 times more data. We should take note of that.
Perhaps the best lessons from Google are in the areas of testing and expectations. They test constantly, to ensure that they can actually recover data. While I think the SQL Server backup system is very solid, I’d be regularly testing restores to ensure that I really can recover from backup files, from external disks and tape, or even complete a restore on a backup system. The other lesson is that Google expects things to fail, so they plan for that, and aren’t surprised by failures. SQL Server gives us options here with Always On and other HA technologies if we can take advantage of them. With a single RDBMS instance, there isn’t a lot most of us can do, but we can at least be prepared to rebuild our instance elsewhere as a last resort.
The Voice of the DBA Podcast