I saw a note from someone that had a database set up in a High Availability (HA) configuration for production. This person had received an 823 error on the primary server, but a failover hadn’t occurred. This wasn’t a critical error, but one that noted some anomalies in a few pages, which potentially could be fixed by the automatic page repair in SQL Server.
In this case, the individual would have liked to have had the system fail over, just in case there were a chance this would impact production. To make this happen, an alert on the error would be needed, which then forced a failover. This wasn’t part of the native SQL Server configuration, and this individual was concerned. However, there are certainly cases where a failover might not be warranted when there is some sort of reaction such as Automatic Page Repair.
If you are running an HA system, I assume you want control over when and why a failover occurs. If know there is a delay for client connections after a failover, or potentially less resources on the secondary node, or some other impact when you move to a secondary note, perhaps you want to be more careful and when a failover occurs. Wouldn’t you want configurable rules, even those that might require manual setup from a DBA? What about if you have a system that isn’t really designed to handle the full, normal workload; it’s just for emergencies. Do you want to fail to a secondary node if the primary node could still be used?
There is a whole spectrum of situations where we might want or not want automated failover for our systems. In fact, if you have something like Mirroring or Log Shipping, it can be a complex process to fail back. In those cases, you really want to be sure something has broken enough that a failover is warranted. I’m sure there are plenty of cases where you might not even want to script a failover because you’d rather take a short outage than fail to a secondary machine only to need a fail back in a short period of time.
Most of us worry a failover won’t happen when the primary system goes down. That’s the main concern we have, and certainly we want to test and be sure this works as we expect in an emergency. I’d also suggest that it might be worth taking a few minutes to think about what happens if your system fails over when you don’t want it? Those can be more problematic, especially if they occur too often and users are dealing with an unreliable system that seems to disappear or pause as it moves from node to node on a regular basis. That might be worse for your reputation than a system that doesn’t fail over in an emergency.