699 views
in RAC by ACE (20,920 points)

1 Answer

by ACE (20,920 points)

Cause:
If the node reboot is by one of the Oracle processes but log files do not show any error, then the culprit is oprocd, cssdmonitor, and cssdagent processes. This happens when the node was hanging for a while or one or more critical CRS processes cannot get scheduled for CPU. Because those processes run in real time, the problem is like due memory starvation or low free memory and not due to CPU starvation. The kernel was swapping pages heavily or was busy scanning memory to identify pages to free up. There could be OS scheduling bug as well.

 

Solution:
1) Set diagwait to 13 if CRS version is 11.1 or lower.
2) If platform is AIX tune AIX VM parameters as suggested in the note 811293.1 (RAC and Oracle Clusterware Best Practices and Starter Kit (AIX)).
3) if the platform is Linux, set up hugepages and set kernel parameter vm.min_free_kbytes to reserve 512MB.  Setting hugepages is probably single most important thing to do on Linux.
Note that memory_target can not be set when using hugepages.
4) if the platform is Linux and kernel is 2.6.18 (i.e. OEL5, Redhat 5, SLES 10) or lower, set kernel parameter swappiness to 100.
Note that there is no need to set kernel parameter swappiness to 100 on Linux Kernel 2.6.32 (i.e. OEL6, Redhat 6, SLES 11) or higher.
5) Disable Transparent HugePages on SLES11, RHEL6, OEL6 and UEK2 Kernels
6) Check if large amount of memory is allocated to IO buffer cache. Talk to OS vendor to suggest ways to reduce the amount of IO buffer cache or increase the reclamation rate of memory from IO buffer cache.
7) Increase the amount of memory.

...