We are trying to do a simple logic with SNMP traps / syslog alerting:
- If the same (bad fan) message is repeated from the same (Host A) host, then suppress subsequent alerts from the same host for 6 hours.
- If the same(bad fan) message is repeated from different (Host A and Host B) hosts, then trigger the alert but suppress subsequent alerts from the same host for 6 hours.
Unfortunately, I cannot find a way to make this work. It does a fantastic job of suppressing the same message from ALL hosts. For example:
Host A reports a fan failure (this alert is now suppressed for 6 hours)
Host B reports a fan failure within the 6 hours (this alert is also suppressed; as a matter of fact, this alert will never trigger until the issue with Host A is fixed)
I read other posts where the response was to create a rule for each device; this is a wonderful suggestion if your environment contains two devices. It doesn't scale so well when your environment contains thousands of nodes.
The scenario mentioned above is true for both syslog messages and SNMP traps.
Has anyone come across how to make this work?