We're trying to figure out a way to capture threshold triggers AND manage the deluge of alerts that this might cause for us. For example, monitoring response time or CPU requires a fine balance between tuning out the noise in a standard monitoring environment without blocking out the actual alerts.
The initial thought was to do some sort of alert counting and, if X alerts happened within a fixed period of time we'd trigger an alert that would be have some mandatory triage and escalation rules associated with it. As far as I can tell there is no way to hold an alert trigger until a threshold violation count has been reached. (I know you can do an 'if alert is triggered for X time').
How are other people handling this? I'm even considering using a 3rd party solution to capture some of this alert noise and parsing it into only actionable alerts.
Thoughts?