We recently had a PRODUCTION level issue which uncovered an issue with one of the multi-action alerts that I setup.
Basically, I wanted an email to be sent, a NetPerfmon event to be logged, and the 3rd was an imported action which sends a SOAP request to a 3rd party alerting system -- which pages our on-call technicians (PagerDuty).
During this particular outage, the SOAP request failed -- so the on-call's were not paged/alerted and resulted in HOURS of downtime. Management is now perceiving Solarwinds as a 'single point of failure' as far as production alerting goes. It had been reliable up til that point - so I would rather just DETECT that the alert action had failed, and take an alternative action.
Is this possible? Basically, I want to show that we can 'monitor the monitor' itself -- and provide a fail safe method of understanding that any of the alert actions we assume worked -- actually did. Perhaps I can setup a Powershell script to monitor a specific DB table for failure codes?
If someone can provide suggestions or examples of how they would handle this, that would be great. Thanks!