If you've been playing along at home you will have likely implemented adatole's and the slightly modified version (by yours truly!) for Linux While I don't have a problem with either alert for small and medium sized environments when you start scaling to large environments things can get a little hairy.
What is large? We have nearly 11,000 nodes and almost 145 million entries in the CPUMultiLoad view. (You can find out how many rows you have by running SELECT COUNT(NodeID) FROM CPUMultiLoad WITH (NOLOCK) against your DB) While examining our database via Database Performance Analyzer (you have DPA, don't you!?) we noticed that the Alert Status query associated with our CPU alerts kept showing up as a source of blocking. Blocking queries generate wait time for other queries while the consume resources to complete. In this case the blocking appeared to be caused by a query that was taking a long time to execute. I put on my SQL query detective's hat and went to work.
The core of the Server and Application Monitor component. The top of the query looks like this:
is a pair of INNER JOINS to get the number of CPUs for each node so that it can be compared to the Win_Processor_Queue_Len value that is captured by the
SELECT DISTINCT Nodes.NodeID AS NetObjectID, Nodes.Caption AS Name
FROM Nodes
INNER JOIN APM_AlertsAndReportsData ON (Nodes.NodeID = APM_AlertsAndReportsData.NodeId)
INNER JOIN
(SELECT c1.NodeID, COUNT(c1.CPUIndex) as CPUCount
FROM
( SELECT DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex FROM CPUMultiLoad ) c1
GROUP BY c1.NodeID
) c2
ON Nodes.NodeID = c2.NodeID
That innermost SELECT statement is where we run through the entire CPUMultiLoad table so that we can then select the NodeID and count of the CPUIndex. Given that this view (as CPUMultiLoad is a view, not a table) has 144 million rows in our DB (and since you don't really change the number of CPUs on a server all that often) there might be a better way to perform this query. Here's what we did:
INNER JOIN APM_AlertsAndReportsData
ON (Nodes.NodeID = APM_AlertsAndReportsData.NodeId)
INNER JOIN
(
SELECT DISTINCT CPUMultiLoad.NodeID, MAX(CPUMultiLoad.CPUIndex)+1 AS CPUCount
FROM CPUMultiLoad
WHERE DATEDIFF(hh,TimeStampUTC,GETDATE()) <=4
GROUP BY CPUMultiLoad.NodeID
) c1
ON Nodes.NodeID = c1.NodeID
Instead of grabbing the data from CPUMultiLoad and then selecting it again we removed an INNER JOIN and selected the NodeID and MAX(CPUMultiLoad.CPUIndex) instead. Of course, CPUIndex values start with zero and, since we want a count for comparison against the number of processes running against those CPUs, we added 1 and called it CPUCount. In order to trim the number of rows returned from this SELECT statement (remembering that we have to INNER JOIN the select results (we called it c1) with the Nodes table) we added a WHERE DATEDIFF clause. We take the time stamp in the CPUMultiLoad table and compare the difference, in hours, against the current date and then return only those rows where the hours are less or equal to 4. Why 4? Our Orion environment is set to Eastern time and UTC is 4 hours difference from Eastern time. (I might do 6 hours -- just to be safe for daylight savings time, etc. -- but you get the idea!)
When we ran the two queries back-to-back we found that the updated query returned results 250% faster! I'm not a SQL whiz by any stretch, but I definitely think this is a great step in the right direction.
How would you improve the query?
For the record, this is the entire query for the
:
SELECT DISTINCT Nodes.NodeID AS NetObjectID,
Nodes.Caption AS Name
FROM Nodes
INNER JOIN APM_AlertsAndReportsData
ON (Nodes.NodeID = APM_AlertsAndReportsData.NodeId)
INNER JOIN
(
SELECT DISTINCT CPUMultiLoad.NodeID,
MAX(CPUMultiLoad.CPUIndex)+1 AS CPUCount
FROM CPUMultiLoad
WHERE DATEDIFF(hh,TimeStampUTC,GETDATE()) <=4
GROUP BY CPUMultiLoad.NodeID
) c1
ON Nodes.NodeID = c1.NodeID
WHERE Nodes.n_mute <> 1
AND Nodes.Prod_State = 'PROD'
AND APM_AlertsAndReportsData.ComponentName = 'Win_Processor_Queue_Len'
AND APM_AlertsAndReportsData.StatisticData <= c1.CPUCount
AND ( (nodes.CPU_Crit is null
AND nodes.CPULoad < 90)
OR (nodes.CPU_Crit is not null
AND nodes.CPULoad < nodes.CPU_Crit) )