Subscribe to Posts by Email

Subscriber Count

    696

Disclaimer

All information is offered in good faith and in the hope that it may be of use for educational purpose and for Database community purpose, but is not guaranteed to be correct, up to date or suitable for any particular purpose. db.geeksinsight.com accepts no liability in respect of this information or its use. This site is independent of and does not represent Oracle Corporation in any way. Oracle does not officially sponsor, approve, or endorse this site or its content and if notify any such I am happy to remove. Product and company names mentioned in this website may be the trademarks of their respective owners and published here for informational purpose only. This is my personal blog. The views expressed on these pages are mine and learnt from other blogs and bloggers and to enhance and support the DBA community and this web blog does not represent the thoughts, intentions, plans or strategies of my current employer nor the Oracle and its affiliates or any other companies. And this website does not offer or take profit for providing these content and this is purely non-profit and for educational purpose only. If you see any issues with Content and copy write issues, I am happy to remove if you notify me. Contact Geek DBA Team, via geeksinsights@gmail.com

Pages

Quiz Post #8: How does RAC Instance failure/membership detection happens in Clusterware/RAC?

 

We all know that CGS (when 10g CRS uses, its OCSSD), (When used third party clusterware, CM) will provide Cluster group service which manages nodes integrity and restart them when there is node level (hardware,network,OS, scheduling) issues happens. So it means the CGS will manage the node level issues and also manages the cluster group membership.

But how does a intra instance or inter instance (RAC Instances) communication or membership happens and their failures?

To know more about it, Remember CM works at cluster level not instance level, In order to serve this purpose, the other important aspect managed by CGS is Instance Membership Recovery (IMR) which will be happen when there is a communication failure between RAC instances.

The node manager (NM) in RAC instances provides information about nodes and their health by registering and communicating with the CM. This NM service will be provided by LMON process.

Now take a closure look at registering and communicating of RAC instance to CGS,

Registering:- When ever the instance is mounted the LMON process registers it status to NM and the cluster level it marks the instance is UP. A bitmap is stored in the GRD of the instance (0 means node dead, 1 means node alive) Every time, node(i want to say instance here) joins /leave the cluster this bitmap will be marked according and updated to the other instances.

Communicating:-  (talking especially about instance death detection)  The cgs is responsible for checking whether all member (here instances) are valid. To determine whether all members are alive, a voting mechanism will be used . But where, The each instance CKPT process updates the control file every three seconds about their status in operation known as heartbeat. A block called checkpoint process record for each instance writes its blocks to Control file and thus acquired a block for each instance in control file. After sometime , CGS (NM) determines the votes (blocks, again this is completely different from the votes in the voting disk) before allowing the GES/GCS reconfiguration to proceed if the instance is failed and mark the bitmap accordingly in GRD. there by providing I/O fencing and flush the pending I/O to disk for the failed instance.

So the OCSSD process (CGS) will be doing membership of the RAC instances in terms of IMR and LMON will be responsible in registering with CGS and control file heartbeat (votes) determine the instance health (not the voting disk votes).

Might be confusing, yes that’s what it is.

Reference:- http://www.amazon.co.uk/Oracle-Database-Application-Clusters-Handbook/dp/0071752625

-Thanks

Geek DBA

Comments are closed.