Don't forget self-monitoring
Photo: Håkan Dahlström

Betty Neuman’s recent blog post highlights an important issue that many BPPM customers have been facing recently: gaps in the monitored data.  This frustrates customers that rely on the data for reporting, because until someone looks at a report, they don’t know the data is missing, and by then it’s too late to collect it.  More insidiously, it undermines confidence in the monitoring service you’re providing to your internal customers.

BPPM’s guaranteed data delivery (new with version 9.5) helps by making sure you don’t lose any data points while an agent is disconnected. Betty also mentioned a report-based answer to help spotlight gaps that have occurred.    That helps a lot, and it’s a great step forward from blindness and responding reactively to customer complaints.  But shouldn’t there be a way to directly alert and create incident records when gaps start to appear?

[contentblock id=4 img=html.png]

One of the issues inherent to agent-based monitoring is that if the agent is unable to raise an alarm, it’s difficult to know when there’s a problem.

  • How can you be notified if the server crashes?
  • What if the network dies?
  • What if the agent crashes?

Most troubling, what if everything appears fine on the surface, but the agent has an internal problem which prevents it from raising an alarm?

There have been many attempts to close this gap.  These solutions typically involve designating a central server responsible for the health of all the agents, either by polling them, or by alerting when a persistent connection is lost.  These are variously provided by:

  • Flashing icons in the Patrol console
  • Alerts on lost agent connections in the event management views
  • Availability KM
  • Patrol Infrastructure KM
  • Self Monitoring KM

Unfortunately, there is a common scenario which can lead to missed alerts just as surely as not having an agent running, but which escapes detection by the above methods.  An agent with health problems can be just as ineffective as not having an agent at all.  There are many circumstances that can lead to an agent being unable to monitor and/or trigger notifications.

  • Misconfiguration
  • Incorrect default account credentials
  • Default account locked
  • Firewall issues
  • Memory leaks
  • Insufficient resources to do normal tasks
  • Failures in dependencies
  • Others

Fortunately, there’s another way.

Solution

Instead of just asking the agents if they’re still alive, a better approach is to have the agents monitor their own vital signs and report to an external checker.  Each agent should perform a self-assessment of its own health on a regular basis.  The self-assessment reports should be forwarded to a central server.  The central server should analyze the reports and alert if a report has not been received, or if it indicates that its agent is not healthy.  These alerts should also include probable cause analysis, because who wants to spend time troubleshooting?

We developed a tool that does exactly that.  To my knowledge, the Advantis Agent Health KM is the only solution that provides full coverage for monitoring the health and the integrity of your Patrol environment.

Key Features and Benefits:

  • Monitor the health and availability of the Patrol agent
  • Identify stalled collectors causing gaps in the data
  • Ensure your agents have the resources they need
  • Rest assured that your monitors are working effectively

[contentblock id=4 img=html.png]