Kim Tripp is a great speaker, with a wealth of knowledge. This session was talking about some of the issues with VLDBs.
What’s a VLDB? Definitely the TB range, but Kim talks about the issues that come with 100s of GBs as well. She mentions one client with a 3TB table. Definitely time for VLDB planning.
She also says that in the last couple of years, almost all of her clients use some type of partitioning. Granted, Kim tends to work with larger, more expensive operations, but interesting to see that partitioning is part of many clients.
This is a DR session. When data is damaged, what do you do? The first think is think about recovery as a step by step process. Don’t freak out.
Your first questions should be who, what, and when. Who noticed it, what is wrong, and when did it happen. Key questions to know, not necessarily for blame, but for recovery purposes. Knowing who is important for security or tracing purposes. Kim says no select/insert/update/delete on tables. Always some other mechanism to access data.
If you don’t know who did something, then it becomes more of a detective case to investigate what might be wrong. Definitely more challenging if you don’t know who did something.
What do you do first? It depends. You might not be able to take the database or server offline, so what do you do? It’s an issue. You can’t usually shut down the database services, so you must think carefully about what you will do.
Recovery from a dropped table
Interesting. A nice sales/customer/employee/product table relationship, with “Sales” being dropped. That’s the easiest one to delete with FKs, but it’s the most important table.
A good backup is needed first. As Kim notes, if you don’t have a good backup, you can’t go very far in recovery.
The sales table gets dropped, and one of Kim’s insert scripts starts failing. As expected, if this for real, likely your phone is going to ring.
First thing: note the time that you first hear of something going wrong.
Second thing, Kim looks to set the database access to “restricted access” meaning only dbos are allowed in. I have never used this, but if you really lost something big, you want to do this. If this happens, that means that your application also needs to handle this and show a good message to clients.
Is it better to be back up or have all the data. That’s a good question. It probably will be different for each business, and maybe each database. What is more important? Might be just getting back up in many cases.
MAKE SURE YOU GET THE TAIL of the log if you can. You don’t want to restore without having this.
Use stopat in all restore statements. It doesn’t have an effect in RESTORE DATABASE, but allowing this means that you get in the habit of using it in all RESTORE commands. The same with the NORECOVERY option.
Her final statement has a WITH RECOVERY, RESTRICTED_USER. That’s nice to have, and let the DBA check without users/systems connecting right away.
Right away, Kim creates a snapshot. She can then easily go back to this point. This is also then used to compare to an investigation database that is a copy of what you restored.
She also resets the identity seed to a higher value in this case. She has a gap, but it’s a known gap that might be useful if you recover more data.
One of the key things that I see is Kim keeping notes of times, and what is done, as she goes through the recovery process for this demo. That’s key in disasters. Having a pad/pen or Notepad around is indispensible.
Kim has a cool script that starts running log restores with ever incrementing STOPAT values to find out where there was an issue. In each restore, then uses tablediff to grab the differences between the snapshot and the test restore to get the missing data that might not be in the database that is now live.
- contain the damage
- decide if data loss or downtime is more important
- recover the damaged data
- bring offline components online
- prevent this in the future
A good DR list to keep handy, and think about first.
Containment part 2
One important thing is to contain the issue and perhaps not take the entire db or server offline.
One thing in Enterprise Edition is to take the secondary files offline, can be useful to take an .ndf file offline. You cannot take an .mdf or .ldf offline, but you can with a .ndf, so for large dbs, separating tables into separate filegroups is a good thing.
Kim shows an online index rebuild that moves a table across multiple files and it’s pretty cool. While users are in the system, the table is moved to a new filegroup/files. This partitions the table to get better control of disasters.
Now we lose a file, and the file needs to go offline. This may or may not affect your application. The connections are killed to the db, but a good application will retry and handle this as a transient error. Users that access the partition that is offline will get a trappable, level 16 error.
Restoring a single file with the move command. That’s cool. Never had to do that myself. Not sure you could do this in SQL 2000 when I had a few disasters. Not sure I had Enterprise Edition in most of my jobs.
Very cool session, highly recommended if you have to plan for, or execute, a DR plan.