I got a call a little before 8 last night from the Windows OS staff at my job. A routine patching process (Microsoft systems need so much patching you might as well make a routine of it) had failed. They could not get the database system to start up. I could not figure it out from their description and I soon had a sick feeling in my stomach about this: This server has a very public function, and a failure, even on the weekend, would be really visible–some of our offices are open on Saturday and use this system for routine business with the public. If it was not fixed that evening, thousands of people would be affected. The SQL Server system was in a cluster. The Windows folks were trying to patch the primary node (machine) of the cluster. It was supposed to have automatic fail-over to a backup system, but the fail-over process itself failed. I suggested they try something to get it back to a working state, and went back to my woodwork, desperately hoping I would not get another call, but knowing in my heart I would.
Sure enough, about 20 minutes later I got another call: My fix did not work. So I had to look into it myself. I went to the basement and logged on, expecting a long night. I could not see any clue in the error messages in the log. The SQL Server system was indeed trying to fail over to the other node of the cluster, but was hung when starting there. I could not use the cluster manager to move it back to the original node. All the useful commands were grayed-out.
OK. If I cannot move it with the cluster manager, what next? On the new node the Services Manager said it was running, when it obviously was not. So I tried to stop it using that tool. I wanted SQL Server to be honestly running or not, and I was not too fussy about which, so long as if it said it was running, it actually was running.
This worked. It not only stopped the non-functional SQL Server on the alternate node of the cluster, but also made it fail back to the primary node and start properly there. This was great, but the Windows team still had to do their patching. So, after a little consultation, I was able to use the cluster manager, now fully functional, to stop the system on the primary without failing over. The other guys were then able to patch windows and then restart the system, so our offices could open normally this morning and the public would not have to hear: “I am sorry, our computers are down.”
During all of this I had no idea what had caused the fail-over process to fail, completely undercutting our design for a high-availability system. But while staring at the configuration I noticed something which was unusual, and which I had thought was impossible in a clustered system. However, I did not test it. The system was up and running, and I did not want to take a chance. I will discuss it with my colleagues, and arrange a test when all the brains we can get are available and well rested.