Yesterday we had a midnight incident with Cisco 3750 (seems very choosy about the time window), all end hosts connected to the switch lost network connection and the switch was also not manageable. Finally we resumed service by reloading the switch. Solarwinds didn't send out any email alert during the whole outage. It was a failure at various levels for the network management setup.
1. The email alerting stopped working few days earlier and I came to know of it only after the incident. It seemed to be a permission issue, but still inexplicable why those SMTP failure events were not highlighted in the Web console. I spend half my time staring in to that console and I am sure I would have noticed it. And it seems inexplicable that the Alerting engine was working fine for more than a year and suddenly it became too fussy about running as LOCAL SYSTEM account.
2. Two hours before the loss of network connection, the switch stopped responding to SNMP polls. This we found out using missing CPU load data from historic charts. However the device was still pingable and hence no node down event was recorded (Up until the manual reload). To me, it seems the switch was already showing signs of outage when it stopped responding to SNMP. Could we have noticed it earlier? Is it possible to send out email alerts if a SNMP poll fails?
The greatest embarrassment for a monitoring system is when user reports that the network is down and everyone looks surprised. Even more so when I constantly have to convince my team to tolerate several false alerts just to not miss any events, but the system failed to throw any alert during a real incident.
Any suggestions for improvement?