Quiz Post #8: How does RAC Instance failure/membership detection happens in Clusterware/RAC?

We all know that CGS (when 10g CRS uses, its OCSSD), (When used third party clusterware, CM) will provide Cluster group service which manages nodes integrity and restart them when there is node level (hardware,network,OS, scheduling) issues happens. So it means the CGS will manage the node level issues and also manages the cluster group membership.

But how does a intra instance or inter instance (RAC Instances) communication or membership happens and their failures?

To know more about it, Remember CM works at cluster level not instance level, In order to serve this purpose, the other important aspect managed by CGS is Instance Membership Recovery (IMR) which will be happen when there is a communication failure between RAC instances.

The node manager (NM) in RAC instances provides information about nodes and their health by registering and communicating with the CM. This NM service will be provided by LMON process.

Now take a closure look at registering and communicating of RAC instance to CGS,

Registering:- When ever the instance is mounted the LMON process registers it status to NM and the cluster level it marks the instance is UP. A bitmap is stored in the GRD of the instance (0 means node dead, 1 means node alive) Every time, node(i want to say instance here) joins /leave the cluster this bitmap will be marked according and updated to the other instances.

Communicating:- (talking especially about instance death detection) The cgs is responsible for checking whether all member (here instances) are valid. To determine whether all members are alive, a voting mechanism will be used . But where, The each instance CKPT process updates the control file every three seconds about their status in operation known as heartbeat. A block called checkpoint process record for each instance writes its blocks to Control file and thus acquired a block for each instance in control file. After sometime , CGS (NM) determines the votes (blocks, again this is completely different from the votes in the voting disk) before allowing the GES/GCS reconfiguration to proceed if the instance is failed and mark the bitmap accordingly in GRD. there by providing I/O fencing and flush the pending I/O to disk for the failed instance.

So the OCSSD process (CGS) will be doing membership of the RAC instances in terms of IMR and LMON will be responsible in registering with CGS and control file heartbeat (votes) determine the instance health (not the voting disk votes).

Might be confusing, yes that’s what it is.

Reference:- http://www.amazon.co.uk/Oracle-Database-Application-Clusters-Handbook/dp/0071752625

-Thanks

Geek DBA

All about Database Administration, Tips & Tricks

New Features for DBA’s

Subscribe to Posts by Email

Subscriber Count

Disclaimer

Recent Posts

Categories

Archives

Pages

Quiz Post #8: How does RAC Instance failure/membership detection happens in Clusterware/RAC?

All about Database Administration, Tips & Tricks

New Features for DBA’s

Follow Me!!!

Subscribe to Posts by Email

Subscriber Count

Disclaimer

Recent Posts

Categories

Archives

Tags

Pages

Quiz Post #8: How does RAC Instance failure/membership detection happens in Clusterware/RAC?