One of the interesting things I saw in the recent GitLab outage and data loss was the fact that none of their backups were available. They use PostgreSQL and I’m not familiar with the ways in which the modern PostgreSQL engine handles backups or the options you have, so I’m not knocking either GitLab or PostgreSQL. It’s possible one or the other had fewer options than we do with SQL Server with our full, differential, log, and filegroup backups, all during live database activity.
There was a live stream and a Google Doc open during the incident, showing the response by their employees (and plenty of Hacker News comments). Kudos to GitLab for their bravery and transparency in showcasing their mistakes and choices. I’ve been in similar situations, and the war room can be chaotic and stressful. There have been no shortage of times when someone makes a mistake under pressure and we scramble to recover from the damage. I’ve made those mistakes and understand how they happen when you get desperate and are tired. This is one reason I’ve usually insisted that when an incident is declared, I immediately send at least one person home to rest. I never know what time I’ll need to get them back.
In reading the notes, there are a number of issues. One of the respondents doesn’t know where the once a day backups are stored (1). The location they check has files only a few bytes in size, so backups might not be working (2). No disk snapshots in their Azure space for database servers (3), though the NFS servers get them. The snapshot process is incomplete, in that once snapshots are made, some data is removed from production, and will be lost in this recovery (4). The backups to S3 don’t work (5). All of this results in a backup that is six hours old being restored. For people that commit code often, this could be a lot of data. Hopefully there weren’t too many merges and branch deletions in this time for customers.
A backup doesn’t matter. A restore matters. It doesn’t matter what backup process you have, if you don’t test it, then you don’t know if you can recover. In fact, with databases (really any system), you need to test the restores regularly because the backup process can fail. I learned this early in my career when one of our admins realized his fancy tape changer that let him only change tapes once a week was broken. The drive had stopped writing and he never noticed.
Not only is it important to monitor that the backup process runs, it’s important to ensure the backup files exist, where we expect them to exist. If this is a remote location, you need monitoring there as well. It’s also important to restore backups regularly. Ideally you’d test every one, but at least get a regular rotation of testing once a week to ensure your process is working.
If you don’t, then you risk not only data loss, as GitLab experienced, but an RGE. That’s a resume generating event, and it’s something none of us would like to experience.