The Ultimate CPU Alert

CPU alerts are a yawner. Grab the CPULoad, check it against a threshold (maybe even a per-node custom threshold, as explained here: TIPS & TRICKS: Stop The Madness: How to set alert thresholds per-device), cut the alert, move on, right?

Here's the problem: If you are working with sophisticated Operations or server staff, you probably already know that they hate CPU alerts because they are

always vague
frequently invalid
way too frequent because they are tuned too low OR
never triggered when you need them because they are tuned too high.

At the heart of the issue is the fact that high CPU, by itself, tells you nothing of use. So the CPU is high? So what? If I've got a box that is constantly running hot but it is keeping up with the work, that's called "correctly sized".

What you really want need to about CPU know are 3 things:

How many processors are in the box
How many jobs are in the Processor Queue
What's the current CPU load

If you've got more jobs in the queue than you have CPUs and you also have high CPU, then you have the makings of a meaningful, actionable issue.

Let's add a little icing on the cake: When the condition above occurs, I want to know what the top 10 processes are at that moment, so I can get an idea of the likely culprits.

Interested? Let's get to work!

For this to work, you need NPM and SAM. You will be assigning one Perfmon counter to all your servers, and doing a little bit of SQL voodoo in the alert.

The Perfmon Counter:

In SAM, set up a new template. In it, you want to add a perfmon counter monitor named “Win_Processor_Queue_Len” that points to

Counter: “Processor Queue Length”,
Instance: (blank)
Category: “System”

After appropriate testing, adjustments, etc, you will eventually roll this template out to all your Windows systems.

The Alert Trigger

Your alert trigger is going to require some hardcore SQL. So you are setting up a Custom SQL Alert, with “Nodes” as the target table.

Along with the top part of the query that is automatically provided, you will add the following:

inner join APM_AlertsAndReportsData

on (Nodes.NodeID = APM_AlertsAndReportsData.NodeId)

INNER join (select c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex

from CPUMultiLoad) c1

group by c1.NodeID) c2 on Nodes.NodeID = c2.NodeID

where

APM_AlertsAndReportsData.ComponentName = 'Win_Processor_Queue_Len'

AND APM_AlertsAndReportsData.StatisticData > c2.CPUCount

AND nodes.CPULoad > 90

What this is doing is

pulling the count of CPU’s for this node from the CPUMultiLoad table
Pulling the current statistic for the Win_Processor_Queue_Len perfmon counter
Checking that the number of processes in the queue is greater than the number of CPU’s
And finally checking that the CPULoad is over 90%

If the conditions in item 3 and 4 are true, you will get an alert.

If you stop here, you have a nifty alert that will tell you when something meaningful (and bad) is going on with your server. But let’s kick it up a notch.

Trigger Action

Your alert action is going to have two key steps:

Run the “Solarwinds.APM.RealTimeProcessPoller.exe utility to get the top 10 processes
After a 60 second delay, send your message

Get the Processes

The “Solarwinds.APM.RealTimeProcessPoller.exe” comes as part of SolarWinds SAM.

NOTE: If you installed SolarWinds somewhere other than the default location (C:\program files (x86)) then you will need to provide the full path to \SolarWinds\Orion\APM\Solarwinds.APM.RealTimeProcessPoller.exe

Otherwise, your command will look like this:

SolarWinds.APM.RealTimeProcessPoller.exe -n=${NodeID} -alert=${AlertDefID} -timeout=60

The only thing you may want to adjust is the –timeout, if you find you are getting alerts coming back with no process information (ie: it’s taking longer for the servers to respond)

Send Your Message

At its most basic, your alert message needs to look like this:

CPU on Node ${NodeName} is at ${CPULoad} at ${AlertTriggerTime}.

Top 10 processes at the time of the alert are:

${Notes}

NOTE: The ${Notes} field is populated with the top 10 processes as part of the previous action.

However, if you want to dress it up, you can include more information using more SQL voodoo:

CPU on Node ${NodeName} is at ${CPULoad} at ${AlertTriggerTime}.

There are ${SQL:Select APM_AlertsAndReportsData.StatisticData from APM_AlertsAndReportsData where APM_AlertsAndReportsData.NodeId = ${NodeID} and APM_AlertsAndReportsData.ComponentName = 'Win_Processor_Queue_Len'} items in the process queue and only ${SQL:Select COUNT(c1.CPUIndex) from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex from CPUMultiLoad where CPUMultiLoad.nodeid = ${NodeID} ) c1 } CPUs to process them.

Top 10 processes at the time of the alert are:

${Notes}

If there is no list of alerts, it's because it took longer than 2 minutes to collect off the server. We felt that delivering the alert fast was more important.

What that big ${SQL… block in the middle does is pull the current Win_Processor_Queue_Len statistic, along with the count of CPUs for this node from the CPUMultiLoad table. The result would read:

There are 10 items in the process queue and only 4 CPUs to process them.

After setting up the message, make sure you go to the “Alert Escalation” tab and set the “Delay the execution of this action” to at least 1 minute.

Summary

So there you have it. A CPU alert that not only tells you when something meaningful and actionable is happening, but it gives you (or your support staff) some initial information to get you started finding and resolving the problem.

As anecdotal proof of how valuable this is, within 24 hours of rolling out this alert at my company, we found 3 different applications which were chronically mis-behaving across the enterprise. 2 resulted in our being able to prove an issue to the vendor (who didn’t believe us) and get a bug-fix under way.

The Ultimate CPU Alert

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112