673 views
in Troubleshooting by ACE (20,920 points)

1 Answer

by ACE (20,920 points)

Though not directly related to stability, OS Watcher and Cluster Health Monitor are invaluable tools for determining the state of the OS and the potential root cause of many problems leading to node or instance evictions. Having the proper data available to diagnose a problem after the first occurrence of any problem will lead to a shorter cycle to determine the cause, and will therefore prevent future outages. Most 3rd party data gathering tools of this type have collection intervals that are too long (i.e. 5 minutes or longer) and / or they are difficult to interprert or do not collect the proper data. OS Watcher is a very simple and lightweight tool that gathers basic OS information every 30 seconds (by default). Cluster Health Monitor, though not available on all platforms, complements OS Watcher by collecting data in real time at a more granular level. It is crucial that one or both of these utilities be running on all cluster nodes at all times, to facilitate more rapid diagnosis and debugging of issues.

...