An Oracle 10GR2 RAC Startup Failure

1. Background
Due to multiple hardware failures on the second node of a production RAC, high availability of the databases on the RAC is lost, a situation no users of a RAC would like to get in. TMX IT, therefore, called a technician form the hardware vendor in today (Saturday) to replace all the defective parts and then to get the node ready for service.
Once the hardware fixed and server rebooted, the on-call DBA (guess who that was… me this time) was contacted to shutdown all the Oracle RAC resource first and then to bring them all back so that the databases would be set to a “cleaning” state, followed by some routine health check before turning the RAC to the applications which use the databases on the RAC as their backbends.

2. The symptoms
Error popped up when I was trying to bring up the nodeapps on the second node, after “successfully” shutdown all the resources:
$srvctl stop database -d edb
$srvctl stop database -d vdb
$srvctl stop asm -n node1
$srvctl stop asm -n node2
$srvctl stop nodeapps -n node1
$srvctl stop nodeapps -n node2.
At this point of time, each and everytime in the past, the RAC is entirely down and I can start to bring it back. And because of this, I skipped ove the step to verify if all the resources are actually “down’, rather I moved directly to the next step to bring the RAC back up.
$srvctl start nodeapss -n node1  — worked)
$srvctl start nodeapps -n node2
Crs error the listener  …. Got misdisplacement, which shocked me! 

3. Check through for the cause
As the error is related to the public listener, the listener was checked up first to make sure it wasup and running:
$ps –ef |grep LISTENER   — both local and management listeners are running
Tried to shut them down and then bring them up, the issue persisted. 
Then the second to check was the RAC status: if all the resources were in the right status: OFFLINE.
$crsstat –t
HA Resource                                Target            State
———–                                     ——             —–
ora.etcash.eint2.eint2.srv           ONLINE      OFFLINE on node2
ora.node2.ASM2.asm                  ONLINE      OFFLINE on node2
ora.vdb.vint2.cs                           ONLINE      OFFLINE on node2 

Surprisingly 3 resources (the highlighted) all on node2 wer actually Targeted as ONLINE, even though they had been shut down!
What was then wrong with node2? Was the CRS running ringht?
$ps –ef|grep crs
$ps –ef|grep d.bin
$ ps -ef|grep d.bin
Instead of the 2 root processes and two oracle processes, only one root process was found running on node2. The root cause of the issue is: the crs not working properly!

4. Workaround
To ensure the RAC starts at a clean state, instead of stopping and starting again the crs (run as root: /etc/init.d/ stop and then /etc/init.d/ start), the Linux Admin was paged to reboot the server and then to execute as root the command:
#/etc/init.d/ start
That was where the business started to back on track again.
Once crs started properly by checking the number of processes running, the crs_stat –t command reported again the databases were up and running on node2. The procedure resumed as:
          shutdown all resources
          verify all resources really are down
          start all the resources
          using crs_stat –t to verify all resources are up and running on the right nodes
          emagent started
A routine verification procedure showed veerything in good shape and the RAC comes back in full service again!

This entry was posted in Blogroll. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s