It was merely 32 minutes before my oncall shift practically done when I got paged to look into a critical alert from the primary node of a RAC. It was hard to depict how my feeling was, but there was not too much time for me to think about it.
Rushing down to my office and getting connected to the company’s network, I found right away that tons of emails had been poured in my email inbox, all about this node inaccessible.
Then there were pages, conference calls, more high priority emails: everyone involved was trying to find out “what is going on?”
My first reaction, of course, was to check from the node, but only found out the node was actually down, and that was the reason for the scheduled batch processes failures and user panics. The on-call sys admin confirmed the node-down shortly.
Right afterwards, my checking from the second node, together with checking from other app servers, revealed that the databases were actually “intact” and still in service on node 2, so the business could still carry on. My email confirmation on this part brought ease to many in the groups.
A bigger effort will be made by the end of today to bring back the node 1, but I know for sure that many guys were driven wild this morning!