Here is what I am trying to do. I need a way to capture nodes that have entered fast poll but have not gone down as per the polling interval. I have nodes that reboot faster than their polling interval as we modify polling intervals based on urgency and impact (ICMP and SNMP polling respectively). If the device is not classified to poll faster than the reboot cycle (think access switch here), then we may miss the node going down as it returns before the node comes out of the fast polling cycle. While a device may miss the first polling cycle and drop into fast poll for 2 minutes, the device may return to service before the fast poll period runs out meaning that the device never shows down. We want to avoid alerting on warning status alone as that will capture every node that misses a single polling interval, not necessarily nodes that actually rebooted.
Our ICMP intervals are 60, 90, 120 and 150 seconds (urgency 1 – 4) and SNMP polling of 5, 10, and 15 minutes (impact 1 – 3). Worst case scenario is that a node:
1) Responds to an ICMP request and then fails
2) Polling engine waits 150 seconds to poll again and misses (maybe)
3) Starts fast polling for 120 seconds (default)
4) Node responds on the last of the fast poll periods for a total of 270 seconds down.
We had been alerting on node reboot, in addition to node down, but since node reboot is based on an SNMP service restart, every time we made a change to the SNMP daemon on a Linux box we got an erroneous node down notification. We noticed the problem on low priority switches that appeared to be rebooting between polling intervals and coming back up to a pingable state faster than the status interval + fast polling period
Any ideas?
I know that we could a) reduce the fast polling period or b) change the urgency of a node, but the later change will also change the nodes priority for escalation is isn't really an option. Just wondering if there are any alert configurations out there that capture for this scenario. Last reboot works, but I have to wait for the node to come back up before I alert -- and that could be a multi-minute delay.
Thanks,
Josh