Revival of the PTE

 

Out of the many things that kept me as busy as a beaver in the past two weeks, recovering the PTE had most of my attentions almost everyday. Today PTE eventually makes a strong come back to have its feet on the ground steadily, again.

 

And that deserves a page or two on my blog!

 

The PTE got stricken down

 

On August 24 the cluster service of the PTE was broken as its quorum disk (SAN disk volume) was lost, some two weeks after the migration of PTE’s storage from the old Hitachi SAN to another SAN which was in turn driven by the company’s initials to cut down its spending on storage units. Two days later three other disk volumes were lost too, left neither of the two nodes of the PTE a database engine and ripped off all the databases together with all the data. PTE became a breathless empty body!

 

Where did the attack come from?

 

The direct cause of the disk failures was later identified as the consequence of the failure of Symantec’s Volume Manager on PTE which acts as an agent between the SAN and the server’s OS (operation system, such as Windows, Linux, etc) to allow all the applications (such as database RDBMS, email, file/directory systems, etc) on the server to access the SAN disk resources attached to the server as if local disks.

 

But why the volume manager failed itself remains unanswered. Could it be a bug? Not sure as out of the thousands clients, none else had reported a similar issue before us. A human mistake? May be or may be not, as there is no trace file to log the migration process.

 

Options to redeem the PTE

 

As the core database components of the PTE had been lost for good, rebuilding it was the only choice we had to get the PTE back in service. There are two possible ways, however, to get the PTE rebuilt, each with pros and cons, as shown in the table. And these two ways were: rebuilding the whole cluster from the OS level, or alternatively rebuilding the database part only.

 

Build Methods

Pros

Cons

Completely clean  rebuild

No registry residuals from previous build – least chance of registry corruption;

Clean binaries installed – run faster;

Better disk performance after formatting system disks

2 teams involved;

Longer time to complete the tasks;

Tight deadline from the client

Database rebuild

Quicker solution;

1 team can do the rebuilding;

Backups for a quicker recovery even if crash happens again.

Extra time to remove as much the installed codes, which may not be always viable;

Chance of performance deterioration;

Chance of registry corruption which in turn may cause crash again.

  

I dislike the Windows registry. It surely provides a central database to keep all the settings and configurations for Windows OS and its applications (Word, IE, you name it) and hopefully to run things faster. But the maintenance is horrible in particular for computers where install/un-install take place often – it can easily become very messy and make the performance inevitably and unacceptably slow. Linux and Unix do much better in maintaining applications as they both manage all applications as “files”.

 

And the manager’s pick on August 26 was the second method: rebuilding the database only. He was betting on me.

 

The Endeavor of rebuilding the database part

 

In summary, following tasks carried out successfully — any failure would have aborted the efforts:

 
– Remove all previously installed components;
– Clean DNS server entry for the database instances so that new ones can be created again;
– Perform cluster installation for 1 instance only, and then test its durability and responses in a few failover satisfactorily (can log into one node only!)
– Perfomr cluster installation for the second instance, and do the same test;
– Perform error check and connectivity checks for both instances;
– Patch the tow instance to match the Vendor’s certified level (need to compare with another environment);
– Perform a complete error check, failover check and connectivity check
– Restore the databases
– Configure the servers and database options, including memory, process paralellism, security, recovery model, default file locations, tempdb
– Recover the maintenance jobs by restoring the msdb database for each instance
– User connection test
 
It took me all together 3 days to finish all the tasks mentioned above. By that time the database part was fully functioning as the back end of the application, and PTE is getting its breath back.
 
Engaging the vendor
 
After the database recovery, vendor’s assistance was a must for the database instance to connect to the application servers so to let the application fly. This part surprisingly turned out to be the most time-consuing. Due to some unforeseen personnel change at both TMX and the vendor, the communication channels amongst the technical team, the clients and the vendor were stuck and there were exceedingly more than necessary emails back and forth for the team to get even trivial things clarified or fixed. 
 
And quite a few times, I had to call the vendor’s developers directly to get the process moving forward. And that went on for alomost ten days long!
 
Finally, the PTE revives today. 
 
And I can wrap it up, going home with peace and ease!
 
Another success story to add into my resume — haha. 
Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

4 Responses to Revival of the PTE

  1. Eve says:

    "Finally, the PTE revives today " —— Congratulation!!!Take a rest and have a good weekend.

  2. 中原古风 says:

    Thanks, Eve. For the first time in months, I have a few shots today on the tennis court with my boys. ^_^

  3. 木钉 says:

    Too technical! What\’s PTE short for?

  4. 中原古风 says:

    I did try to make it easier, seemingly I failed myself.Sorry, 木钉, PTE is an internal code and I cannot say what it stands for here. It is not "Public Trading Engine" for sure.. 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s