The Ultimate CPU Alert

September 10, 2013, 5:49 am

≫ Next: Disable switch stack monitoring for single switches

≪ Previous: Fujitsu Universal Device Poller needed...

CPU alerts are a yawner. Grab the CPULoad, check it against a threshold (maybe even a per-node custom threshold, as explained here: TIPS & TRICKS: Stop The Madness: How to set alert thresholds per-device), cut the alert, move on, right?

Here's the problem: If you are working with sophisticated Operations or server staff, you probably already know that they hate CPU alerts because they are

always vague
frequently invalid
way too frequent because they are tuned too low OR
never triggered when you need them because they are tuned too high.

At the heart of the issue is the fact that high CPU, by itself, tells you nothing of use. So the CPU is high? So what? If I've got a box that is constantly running hot but it is keeping up with the work, that's called "correctly sized".

What you really want need to about CPU know are 3 things:

How many processors are in the box
How many jobs are in the Processor Queue
What's the current CPU load

If you've got more jobs in the queue than you have CPUs and you also have high CPU, then you have the makings of a meaningful, actionable issue.

Let's add a little icing on the cake: When the condition above occurs, I want to know what the top 10 processes are at that moment, so I can get an idea of the likely culprits.

Interested? Let's get to work!

For this to work, you need NPM and SAM. You will be assigning one Perfmon counter to all your servers, and doing a little bit of SQL voodoo in the alert.

The Perfmon Counter:

In SAM, set up a new template. In it, you want to add a perfmon counter monitor named “Win_Processor_Queue_Len” that points to

Counter: “Processor Queue Length”,
Instance: (blank)
Category: “System”

After appropriate testing, adjustments, etc, you will eventually roll this template out to all your Windows systems.

The Alert Trigger

Your alert trigger is going to require some hardcore SQL. So you are setting up a Custom SQL Alert, with “Nodes” as the target table.

Along with the top part of the query that is automatically provided, you will add the following:

inner join APM_AlertsAndReportsData

on (Nodes.NodeID = APM_AlertsAndReportsData.NodeId)

INNER join (select c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex

from CPUMultiLoad) c1

group by c1.NodeID) c2 on Nodes.NodeID = c2.NodeID

where

APM_AlertsAndReportsData.ComponentName = 'Win_Processor_Queue_Len'

AND APM_AlertsAndReportsData.StatisticData > c2.CPUCount

AND nodes.CPULoad > 90

What this is doing is

pulling the count of CPU’s for this node from the CPUMultiLoad table
Pulling the current statistic for the Win_Processor_Queue_Len perfmon counter
Checking that the number of processes in the queue is greater than the number of CPU’s
And finally checking that the CPULoad is over 90%

If the conditions in item 3 and 4 are true, you will get an alert.

If you stop here, you have a nifty alert that will tell you when something meaningful (and bad) is going on with your server. But let’s kick it up a notch.

Trigger Action

Your alert action is going to have two key steps:

Run the “Solarwinds.APM.RealTimeProcessPoller.exe utility to get the top 10 processes
After a 60 second delay, send your message

Get the Processes

The “Solarwinds.APM.RealTimeProcessPoller.exe” comes as part of SolarWinds SAM.

NOTE: If you installed SolarWinds somewhere other than the default location (C:\program files (x86)) then you will need to provide the full path to \SolarWinds\Orion\APM\Solarwinds.APM.RealTimeProcessPoller.exe

Otherwise, your command will look like this:

SolarWinds.APM.RealTimeProcessPoller.exe -n=${NodeID} -alert=${AlertDefID} -timeout=60

The only thing you may want to adjust is the –timeout, if you find you are getting alerts coming back with no process information (ie: it’s taking longer for the servers to respond)

Send Your Message

At its most basic, your alert message needs to look like this:

CPU on Node ${NodeName} is at ${CPULoad} at ${AlertTriggerTime}.

Top 10 processes at the time of the alert are:

${Notes}

NOTE: The ${Notes} field is populated with the top 10 processes as part of the previous action.

However, if you want to dress it up, you can include more information using more SQL voodoo:

CPU on Node ${NodeName} is at ${CPULoad} at ${AlertTriggerTime}.

There are ${SQL:Select APM_AlertsAndReportsData.StatisticData from APM_AlertsAndReportsData where APM_AlertsAndReportsData.NodeId = ${NodeID} and APM_AlertsAndReportsData.ComponentName = 'Win_Processor_Queue_Len'} items in the process queue and only ${SQL:Select COUNT(c1.CPUIndex) from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex from CPUMultiLoad where CPUMultiLoad.nodeid = ${NodeID} ) c1 } CPUs to process them.

Top 10 processes at the time of the alert are:

${Notes}

If there is no list of alerts, it's because it took longer than 2 minutes to collect off the server. We felt that delivering the alert fast was more important.

What that big ${SQL… block in the middle does is pull the current Win_Processor_Queue_Len statistic, along with the count of CPUs for this node from the CPUMultiLoad table. The result would read:

There are 10 items in the process queue and only 4 CPUs to process them.

After setting up the message, make sure you go to the “Alert Escalation” tab and set the “Delay the execution of this action” to at least 1 minute.

Summary

So there you have it. A CPU alert that not only tells you when something meaningful and actionable is happening, but it gives you (or your support staff) some initial information to get you started finding and resolving the problem.

As anecdotal proof of how valuable this is, within 24 hours of rolling out this alert at my company, we found 3 different applications which were chronically mis-behaving across the enterprise. 2 resulted in our being able to prove an issue to the vendor (who didn’t believe us) and get a bug-fix under way.

EDIT 2014-10-31:

As discovered by jbiggley in this post: Custom SQL Alerts - Do reset conditions also need to be custom?, the reset trigger is problematic for this alert (as with all custom SQL alerts). You can't just select "reset when the condition is no longer true". The solution, as elaborated by RichardLetts here: Warning about custom SQL alerts (reset trigger), the reset trigger needs to be:

inner join APM_AlertsAndReportsData

on (Nodes.NodeID = APM_AlertsAndReportsData.NodeId)

INNER join (select c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex

from CPUMultiLoad) c1

group by c1.NodeID) c2 on Nodes.NodeID = c2.NodeID

where

(APM_AlertsAndReportsData.ComponentName = 'Win_Processor_Queue_Len' AND APM_AlertsAndReportsData.StatisticData <= c2.CPUCount)

OR nodes.CPULoad <= 90

The key change here is that you want to reset when EITHER the processes are less than the number of CPU's, OR the CPU load is under the threshold

EDIT 2015-02-23

Hat-Tip to garyuk who caught my greater-than / less-than confusion in the reset logic above. It's fixed now.

↧

Disable switch stack monitoring for single switches

August 23, 2016, 12:55 pm

≫ Next: NetPath doesn't work

≪ Previous: The Ultimate CPU Alert

Hello,

I have some 2960S switches plus additional cisco stack capable models that I am using as single switches. Right now I am getting alerts that their stack ports are having issues even though they are single member switches. Is there a way within solarwinds to disable the stack monitoring functionality on these devices specifically ?

I do not need stack monitoring on devices that I am not using as a stack.

-Thanks

↧

NetPath doesn't work

January 9, 2017, 5:55 am

≫ Next: Privileged users unable to scroll within MAP

≪ Previous: Disable switch stack monitoring for single switches

All ,

I am having issues with NetPath, in the fact that its not doing anything.

"First Poll Not Yet Complete"

It has been like that for quite some time.

I am monitoring an internal devices across the network.

The probe meets the minimum specification for NetPath plugin to be installed and it is.

I have uninstalled the agent, rebooted and reinstalled the agent, all this worked fine and everything is installed as you would expect.

I have since deleted and re-created the path but it is still returning no data.

↧

Privileged users unable to scroll within MAP

January 5, 2017, 12:36 pm

≫ Next: NPM 11.5.2 to 12.0.1 - Now syslog / trap services won't start

≪ Previous: NetPath doesn't work

Hello,

In our environment, when privileged accounts are used with Internet explorer to utilize SolarWinds, the privileged accounts are unable to scroll within the Global Network map. We're able to zoom in, but never scroll within the map. With our regular user accounts, we're able to utilize the map and scroll as normal.

Any suggestions?

Thanks,

Wess

↧

NPM 11.5.2 to 12.0.1 - Now syslog / trap services won't start

December 16, 2016, 11:54 am

≫ Next: Router pair response time reporting

≪ Previous: Privileged users unable to scroll within MAP

All prerequisites met successfully, the upgrade went well until the very end when this message appeared:

Checking Orion Service Manager, all services are running except Syslog - which is stuck in "stopping" and trap, which is stopped and won't start.

New firewall ports --- are they TCP or UDP, the NPM 12.0 Release Notes *does not specify* - could this be a problem with the site and services simply not starting?

Thanks,

Bill

↧

Router pair response time reporting

January 7, 2017, 1:52 pm

≫ Next: Wireless Heat Maps - 2.4Ghz and 5Ghz connections

≪ Previous: NPM 11.5.2 to 12.0.1 - Now syslog / trap services won't start

I would like to create a report that builds a matrix of response times between our WAN routers. Report would show mesh response times between any 2 routers of the 20 we have. Is this possible?

↧

Wireless Heat Maps - 2.4Ghz and 5Ghz connections

October 7, 2016, 1:53 am

≫ Next: Change Window Monitoring

≪ Previous: Router pair response time reporting

Hi guys wondering if anyone could shed some light on this.

We've finally upgraded to the latest version of Solarwinds (12.0.1) and are now able to use the heat map feature and so far it seems like it's going to be a lot of help. I'm just wondering if anyone could shed some light on how it handles the different frequencies. We have Cisco 2802i APs which support both 2.4GHz and 5GHz connections. The question is; how does the heat map interpret this? or doesn't it?

I'm guessing what I am seeing is the 2.4GHz connection but it would be nice to know exactly what I am seeing and if the 5GHz feature has been or is going to be implemented.

Thanks!

↧

Change Window Monitoring

January 9, 2017, 8:39 am

≫ Next: Events not generating emails after upgrading to NPM 12.0.1

≪ Previous: Wireless Heat Maps - 2.4Ghz and 5Ghz connections

we have a regular change window every week, i would like to stop the monitoring of systems during the change window. i know that i can go in and "unmamage" the affected devices. is there a better way, maybe an automated way to stop monitoring every week during the change window?

↧

Events not generating emails after upgrading to NPM 12.0.1

December 13, 2016, 4:56 am

≫ Next: Getting email alert with the node name specified (NPM)

≪ Previous: Change Window Monitoring

Hi all. I've looked at different threads regarding emails failing out of NPM in general, but the traditional fixes aren't working for me. I've verified the SMTP server, I've sent test emails, and I get emails fine from NCM (which I installed the same day I upgraded NPM) - and the alerts appear to be firing fine on the alerts page/events page, but we've traditionally gotten emails for these events. I'm just not sure where to look to specifically troubleshoot emails with one portion of NPM and not Orion in general. All of the alerts that were there before are still turned on, but anything that shows up specifically on the "events" page is not generating an email like it was before. I've looked multiple times and cannot figure out where this is breaking. Any suggestions? Thanks!

↧

Getting email alert with the node name specified (NPM)

January 9, 2017, 9:23 am

≫ Next: Report to find nodes associated with Windows Crendential

≪ Previous: Events not generating emails after upgrading to NPM 12.0.1

Hi All,

I was configuring the alerts for different nodes on my network using NPM. I realized we cannot see the name of the Node on which the alert has been triggered.

Is there any way that we can specify the variable for the node name in the email action, so that it would be easy for us to figure out which device triggered the alert without getiing into the NPM.

Thanks in advance,

Sathvik

↧

Report to find nodes associated with Windows Crendential

April 21, 2014, 7:23 am

≫ Next: Discovery settings changing after running config wizard?

≪ Previous: Getting email alert with the node name specified (NPM)

We have an NPM instance where in 'Manage Windows Credentials' there's a credential name assigned to 15 nodes. We're trying to determine what 15 nodes that credential is assigned to. I don't see in web or report writer, a template that could be readily modified to extract this info without having to create a customized SQL query.

I looked in SQL Management Studio and believe I found the table entry, but not sure how to extract it.

Here it is in the table:

Any suggestions other than my creating a custom query?

Thanks

↧

Discovery settings changing after running config wizard?

September 28, 2016, 9:51 am

≫ Next: orion universal device poller object reference not set to an instance of an object. NPM 12.01

≪ Previous: Report to find nodes associated with Windows Crendential

Has anyone else run into this? The configuration wizard ran during an upgrade to NTA 4.2, which seems to have knocked out some of the settings I had in my discoveries. I love the "Define Monitoring Settings" feature in the Sonar Wizard, it's what keeps my nightly discovery from gobbling up all those useless WAN Miniport interfaces

For some reason, after the config wizard ran, the filters in the picture above were wiped out and all of the checkboxes for status, port mode, and hardware were selected. Additionally the IIS, SQL, and Exchange monitors under the Applications section were changed to be included in the discovery. Needless to say the next morning our polling engines were overrun, and it took a few hours to manually delete the thousands of newly imported WAN Miniport interfaces and the hundreds of AppInsight monitors. Maybe this was a fluke, but it seems like this shouldn't happen?

↧

orion universal device poller object reference not set to an instance of an object. NPM 12.01

December 21, 2016, 4:31 am

≫ Next: NPM 'niggles' and other random stuff I am encountering.

≪ Previous: Discovery settings changing after running config wizard?

After upgrading to NPM 12.01 i started getting error from Orion UDP application . I had a similar problem with version 11.5.3 but it got fixed by a hotfix :

Orion Universal Device Poller: Unspecified Error - SolarWinds Worldwide, LLC. Help and Support

I already tried the manual database backup , restarting all the service and a full server reboot. The only thing that solves the problem is a full server reboot but this is a solution i can only aply on Sunday mornings.

Any ideea what service or process needs to be restarted in order to fix this without a server reboot?

Only error in UDP log:

2016-12-21 07:39:58,927 [1] ERROR Program - Main Form Unhandled exception.

System.NullReferenceException: Object reference not set to an instance of an object.

at CustomPollerManager.TreeController.InitMibParentNodes()

at CustomPollerManager.BrowseMIBForm.InitMibParentNodes()

at CustomPollerManager.BrowseMIBForm..ctor(NetworkObjectType netObjType, String oid)

at CustomPollerManager.CustomPollerUltraScenes.DefineCustomPollerScene.btnBrowseMib_Click(Object sender, EventArgs e)

at System.Windows.Forms.Control.OnClick(EventArgs e)

at Infragistics.Win.Misc.UltraButtonBase.OnClick(EventArgs e)

at Infragistics.Win.Misc.UltraButton.OnMouseUp(MouseEventArgs e)

at System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)

at System.Windows.Forms.Control.WndProc(Message& m)

at System.Windows.Forms.Control.ControlNativeWindow.OnMessage(Message& m)

at System.Windows.Forms.Control.ControlNativeWindow.WndProc(Message& m)

at System.Windows.Forms.NativeWindow.Callback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)

↧

NPM 'niggles' and other random stuff I am encountering.

January 6, 2017, 5:47 am

≫ Next: NPM reports Device already in OrionDB, but it is not

≪ Previous: orion universal device poller object reference not set to an instance of an object. NPM 12.01

Good afternoon.

So I have installed all apps on a new box, built a new, shiny SQL clustered environment to ensure that things stay up. I am looking to get funding for failover with solarwinds too.

The thing is, I have the following problems:

- New NPM 12 is REALLY slow. I mean almost like old man trying to write a birthday card slow.

- I don't think it's the spec of the machine, or what the machine has on (in terms of applications and nodes) It has five solarwinds applications on:

- NPM

- NCM

- IPAM

- WPM

- SAM

It is the following spec VM running in VMware 5.5, but an upgrade is on the way (these are gen 9 hp's) I checked with solarwinds and for our licensing, this would be perfectly fine spec. Baring in mind, the SQL servers are exactly the same spec of machine.

- 16gb ram

- 4 cores (3ghz)

- Windows server 21k12

- These are attached to fast storage (it is on the same storage as the SQL databases)

I type in the IP address (xxx.xxx.xxx.xxx) and it literally just freezes. it sits there and it's 'waiting' I have tried IE, firefox and chrome (I even wanted to get Opera out.) and it just sits there waiting for the bus. The only address that does work is xxx.xxx.xxx.xxx/Orion/admin. I have checked our older soalrwinds and have copied the exact same IIS settings to make sure it's not that (though we were using old software, hence the new build and not an upgrade). I have ensured that all services and solarwinds services are running, done a reboot, rolled back a snapshot and re-installed and it is still doing the same results.

There are only three nodes that are added to this server (SQL and solarwinds itself) Am I missing something inside IIS (I am not an expert of IIS.) that stops me from connecting to this via the ip address and not from an account itself?

↧

NPM reports Device already in OrionDB, but it is not

January 10, 2017, 6:07 am

≫ Next: Dependency Question

≪ Previous: NPM 'niggles' and other random stuff I am encountering.

We are running NPM 11.5.2. We are attempting to import 10 Windows devices into NPM. The discovery is successful. When we go to import the devices, NPM says these devices already exist in the OrionDB. We have checked the Nodes table and the devices definitely do not exist in the DB by name OR IP address. They are also only single NIC devices. We ran the configuration wizard as well as Database Maintenance and reran the discovery and import process with the same results. We again verified that the devices do not exist in Managed Nodes and the Nodes table of the DB by caption name OR IP address. We ARE able to manually add the devices to Orion without any problem.

Has anyone seen this issue before and might know what is causing the problem?

Scott

↧

Dependency Question

January 10, 2017, 6:08 am

≫ Next: Network Discovery Alert Variables

≪ Previous: NPM reports Device already in OrionDB, but it is not

I'm using SAM to verify HTTP status for a customers web server (set the poll for every 60 seconds). I am also polling 8.8.8.8 (Google Free DNS) via ICMP (default poll rate of 120 seconds). I set up a dependency for the HTTP poll and set the google DNS as the parent. I was getting a ton of notifications that HTTP was down on www.customer.com (but came back up immediately after). Couple questions re: polling:

1. I know that SW polls the child and then checks the parent status. What if the parent status is Warning? I think my internet connection was bouncing and google would be warning and then the HTTP check would fail and send out a notification.

2. Can I set a node to be down after only 3 missed polls? If I lose my ISP google goes to warning, waits 10 seconds, stays in warning, repeat, repeat... during this time if SAM polls my server it will see it as down. Wouldn't it be helpful to mark a node as down after only a couple missed polls?

↧

Network Discovery Alert Variables

November 1, 2016, 10:47 am

≫ Next: Using variables in Advanced SQL Report

≪ Previous: Dependency Question

Getting emails from scheduled discovery jobs is very helpful, not knowing which discovery they come from isn't. I feel its possible but I need some SQL help to sort it out.

Info:

There are 2 tables we need to tie together:

DiscoveryProfiles

DiscoveryJobs

The discovery notification comes from the discovery jobs table. If you look at the table you can see a row for every job. That row contains variable "ProfileID".

I need to tie the "ProfileID" variable from the "DiscoveryJobs" table to the "ProfileID" in the "DiscoveryProfiles" table so I can pull out the "Name" of that discovery.

I would like to have the "Name" of the profile inserted into the email so when i get 6+ a night i know which job it came from. This would also be helpful in the discovery failed emails so you would know which discovery failed to run.

↧

Using variables in Advanced SQL Report

January 9, 2017, 5:13 pm

≫ Next: Top Syslog Senders

≪ Previous: Network Discovery Alert Variables

A while back I wrote a complex SQL report for Report Writer that used something similar to the ${Variable} pass through to the SWIS when the report was run. I want to say it used the ${PollerID} variable to dynamically assign the custom poller ID within the query eliminating the need to dig up the custom poller ID that was used after creation.

When I try using it now, I get an error in either the Web reporting tool or Report Writer. Did anything change I'm not aware of?

↧

Top Syslog Senders

January 10, 2017, 9:06 am

≫ Next: Top SNMP-Trap Senders

≪ Previous: Using variables in Advanced SQL Report

↧

Top SNMP-Trap Senders

January 10, 2017, 9:11 am

≫ Next: Add “Search for Nodes" element in every Solarwinds Webpages header with ‘hit’ Enter enabled

≪ Previous: Top Syslog Senders

↧