A Challenge: CPU Usage Alerts for a Single CPU/vCPU

I had a team approach me a few days ago and ask for a pretty wild request. They wanted to know when a single CPU exceeded the pre-defined CPU_Crit threshold (a custom property that we use) or the default 90% (if there is no CPU_Crit value defined) for a period greater than 10 minutes. Apparently there are some applications out there that still aren't coded to be multi-threaded. (Gasp!)

First, this is a node based alert as we are reporting on CPU usage. Second, it is NOT a CPU based alert. Seems redundant but it's not. When an alert triggers in Orion (whether it be NPM, SAM, etc.) the alert definition (AlertDefinitions.AlertDefID) is associated with the node ID. Only one instance of an active alert can exist at the same time. That isn't usually a problem except in a case like this where the app owner wanted to know each and every time an individual CPU/vCPU exceeded their threshold. I'll save you the suspense and tell you I couldn't figure out a way to make that work. See my alert message logic in a later post to see how I tried to work around that problem.

First, we use a couple of custom properties in our implementation. N_Mute allows us to mute a node so that alerts don't trigger but still allow data to be collected. Prod_State lets us define the status of a server in the development cycle in case a development team wants to see alerts for only dev or the run teams only want production servers. As previously mentioned, we also use CPU_Crit (and CPU_Warn) to define custom CPU utilization levels. We have defaults (90% for critical) but we like to allow the teams to tweak those as required. Finally, we collect statistical data every 5 minutes from each of our nodes. This is important to note as the request was to alert after 10 minutes (or 2 polling cycles).

OK -- now the code!

The SELECT Nodes.NodeID, Nodes.Caption FROM Nodes is pre-populated when you select a nodes-based alert. Nothing to see here.

The INNER JOIN CPUMultiLoad ON Nodes.NodeID = CPUMultiLoad.NodeID allows us to get access to the individual CPU data. The Nodes table contains a value (Nodes.CPULoad) which is echoes from the CPULoad.AvgLoad. Yes, it is an average load as of the last statistics collection period. In this case we want to know the average load, but we need to check the average load of each individual CPU. (For those of you playing along at home you've probably noticed that you can't find a CPUMultiLoad table. If you are using SQL Server Management Studio expand the Views folder. CPUMultiLoad is a view, not a table. Views allow lightweight access to a specific dataset without having the directly query the table. Wikipedia has a nice article about SQL views.

The WHERE clause gets a little fancy, but not too bad. Broken down we check for nodes that aren't muted AND nodes that are in production AND (pay attention here!) data in the CPUMultiLoadview that is less than or equal to 10 minutes old. Why? Well the CPUMultiLoad view has a tonne of data in it. It holds a view (tada!) of the individual CPU performance data for all of the CPUs for the last ~7 days. (The data rolls up into CPUMultiLoad_Hourly table depending at midnight each night retaining data based on your data retention settings.) A note of caution. Make sure you include the AND DATEDIFF(mi, CPUMultiLoad.TimeStampUTC, getdate()) > 0 statement. This time period of > 0 and <= 10 minutes should match the alert trigger delay.

The final section of the WHERE clause leverages our custom properties. The first section checks to see if the CPU_Crit is empty and then check for CPU average load >= 90%. The second section checks that there is a value in the CPU_Crit and uses that value for the comparison. There is a little insurance sprinkled in there though. Did you notice it? We already have an alert for average CPU usage (looking at the entire node CPU usage) using Leon Adato's infamous The Ultimate CPU Alert. By checking that the CPULoad is < 90 (or CPU_Crit) I can avoid most of the redundant alerts as our current CPU alert checks against those values already. This alert should only trigger if the average CPU usage for the entire node is less than our defined values.

This is my first draft at this alert logic so please feel free to give me any feedback. I would love to figure out how to compare CPUCount against the number of CPUs that are over CPU_Crit so that I can further protect against duplicate alerts. The logic would be something like "Count the number of CPUMultiLoad.CPUIndex WHERE AvgLoad >= CPUCrit and ensure that the count is not >= the CPUCount".

SELECT Nodes.NodeID, Nodes.Caption, CPUMultiLoad.CPUIndex FROM Nodes

INNER JOIN CPUMultiLoad ON Nodes.NodeID = CPUMultiLoad.NodeID

WHERE

Nodes.n_mute <> 1

AND Nodes.Prod_State = 'PROD'

AND DATEDIFF(mi, CPUMultiLoad.TimeStampUTC, getdate()) <= 10

AND DATEDIFF(mi, CPUMultiLoad.TimeStampUTC, getdate()) > 0

AND

(

(nodes.CPU_Crit is null

AND CPUMultiLoad.AvgLoad >= 90

AND Nodes.CPULoad < 90)