We got a known issue with ASM not coming up in second and subsequently failed to start crs and other resources.
Review grid alert log & os Logs
- $GRID_HOME/log/<nodename>/alert<nodename>.log
oifcfg shows
$oifcfg –getif
eth0 192.168.2.10 global_clusterinterconnect
eth1 192.168.10.2 global
usb0 169.254.95.0
eth0:2 169.254.96.0
eth0:3 169.254.95.0
From 11g R2 (I believe 11.2.0.2 onwards) there is a cluster resource called HAIP which used to manage the cluster interconnects high availability. Prior to 11gr2 if cluster interconnect goes down there will be hang/node evictions depends on the situation. Where in from 11gr2 onwards we can specify up to 3 (as I known) cluster interconnects for a cluster which internally manages with this non-routable IP’s, Essentially, even if one of the physical interface is offline, private interconnect traffic can be routed through the other available physical interface. This leads to highly available architecture for private interconnect traffic.
Nice explanation from Riyaz’s Note:-
HAIP, High Availability IP, is the Oracle based solution for load balancing and failover for private interconnect traffic. Typically, Host based solutions such as Bonding (Linux), Trunking (Solaris) etc is used to implement high availability solutions for private interconnect traffic. But, HAIP is an Oracle solution for high availability. During initial start of clusterware, a non-routeable IP address is plumbed on the private subnet specified. That non-routable IP is used by the clusterware and the database for private interconnect traffic.
Now back to the issue:-
As you can see the usb0 is attached to the the internal IP (169.254.X.X is an non routable IP’s range internal to OS) , clusterware will confused with this and unable to start the crs resources promptly.
Clusterware picked two IP addresses on 169.254.x.x subnet on eth1 private interface as shown below. These two IP addresses will be used by the clusterware and RAC database for private interconnect traffic.
$ifconfig –a
...
eth0:2: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
inet 169.254.95.95 netmask ffff8000 broadcast 169.254.95.255
eth0:3: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
inet 169.254.95.96 netmask ffff8000 broadcast 169.254.95.255
…
Review of database shows that these two IP addresses are used for private interconnects in node 1.
SQL> select * from gv$cluster_interconnects
INST_ID NAME IP_ADDRESS IS_ SOURCE
---------- --------------- ---------------- --- -----------------------------1 eth0:3 169.254.95.95 NO
1 eth0:2 169.254.95.96 NO
Solution:-
For now we got to know that some intermediate device like USB0, (usb printers also possible) enabled in server causing this issue we have asked Unix team to disable the same, Further also request them to disable at bios level to not get repeated the same after a reboot.
Hope this helps!!!!
Hi,
The usb0 is a device provided by the IMM’s server on the OS side and this interface is configured to use a DHCP server. Since no link and no DHCP server are available an adress in the 169.254.0.0/16 subnet is picked up, as stated in the RFC. At this point, tthe problem is in fact that Oracle doesn’t respect the terms of RFC 3927 ? For me , it’s an Oracle bug…
Hi,
I am not too good on OS / hardware, what you told may be correct. but from the 11gR2 RAC cluster interconnect concerning, this RFC (169.254.*.*) will be used to provide high availability, where in if anything (any device) is using this route (i.e 169.254.*.*) will be reached by clusterware for checking cluster integrity and as they cannot respond the way oracle cluster understand, the nodes will be evicted or does not start. This is basically a confusion (may be oracle handling not correctly or bug as you said) between the cluster to see the devices (usb) as network devices.
-Thanks
Geek DBA
Hi,
had an error 481 today. But on my 2 cluster nodes the usb network device already was disabled.
Grid Infrastructure was running on node 1 but couldn’t start on node 2 after reboot of node 2.
In Oracle KB (Document ID 1383737.1) I found the solution.
Node 1 (yes Node ONE, not TWO) had no route to 169.254.0.0/16
Node 2 had the correct route.
Adding the route on Node 1 with “route add -net 169.254.0.0 netmask 255.255.0.0 dev bond1” saved my friday 🙂
Cheers
[…] http://db.geeksinsight.com/2012/09/24/pmon-ospid-nnnn-terminating-the-instance-due-to-error-481/ […]