Due to multiple hardware failures on the second node of a production RAC, high availability of the databases on the RAC is lost, a situation no users of a RAC would like to get in. TMX IT, therefore, called a technician form the hardware vendor in today (Saturday) to replace all the defective parts and then to get the node ready for service.
Once the hardware fixed and server rebooted, the on-call DBA (guess who that was… me this time) was contacted to shutdown all the Oracle RAC resource first and then to bring them all back so that the databases would be set to a “cleaning” state, followed by some routine health check before turning the RAC to the applications which use the databases on the RAC as their backbends.
2. The symptoms
Error popped up when I was trying to bring up the nodeapps on the second node, after “successfully” shutdown all the resources:
$srvctl stop database -d edb
$srvctl stop database -d vdb
$srvctl stop asm -n node1
$srvctl stop asm -n node2
$srvctl stop nodeapps -n node1
$srvctl stop nodeapps -n node2.
At this point of time, each and everytime in the past, the RAC is entirely down and I can start to bring it back. And because of this, I skipped ove the step to verify if all the resources are actually “down’, rather I moved directly to the next step to bring the RAC back up.
$srvctl start nodeapss -n node1 — worked)
$srvctl start nodeapps -n node2
Crs error the listener …. Got misdisplacement, which shocked me!
3. Check through for the cause
As the error is related to the public listener, the listener was checked up first to make sure it wasup and running:
$ps –ef |grep LISTENER — both local and management listeners are running
Tried to shut them down and then bring them up, the issue persisted.
Then the second to check was the RAC status: if all the resources were in the right status: OFFLINE.
HA Resource Target State
———– —— —–
ora.etcash.eint2.eint2.srv ONLINE OFFLINE on node2
ora.node2.ASM2.asm ONLINE OFFLINE on node2
ora.vdb.vint2.cs ONLINE OFFLINE on node2
Surprisingly 3 resources (the highlighted) all on node2 wer actually Targeted as ONLINE, even though they had been shut down!
What was then wrong with node2? Was the CRS running ringht?
$ps –ef|grep crs
$ps –ef|grep d.bin
$ ps -ef|grep d.bin
Instead of the 2 root processes and two oracle processes, only one root process was found running on node2. The root cause of the issue is: the crs not working properly!
To ensure the RAC starts at a clean state, instead of stopping and starting again the crs (run as root: /etc/init.d/init.crs stop and then /etc/init.d/init.crs start), the Linux Admin was paged to reboot the server and then to execute as root the command:
That was where the business started to back on track again.
Once crs started properly by checking the number of processes running, the crs_stat –t command reported again the databases were up and running on node2. The procedure resumed as:
– shutdown all resources
– verify all resources really are down
– start all the resources
– using crs_stat –t to verify all resources are up and running on the right nodes
– emagent started
A routine verification procedure showed veerything in good shape and the RAC comes back in full service again!