Quantcast
Channel: THWACK: All Content - Network Performance Monitor
Viewing all articles
Browse latest Browse all 21870

The Ultimate CPU Alert for Large Environments

$
0
0

If you've been playing along at home you will have likely implemented adatole's The Ultimate CPU Alert and the slightly modified version (by yours truly!) for Linux The Ultimate CPU Alert ... for Linux!  While I don't have a problem with either alert for small and medium sized environments when you start scaling to large environments things can get a little hairy.

 

What is large?  We have nearly 11,000 nodes and almost 145 million entries in the CPUMultiLoad view.  (You can find out how many rows you have by running SELECT COUNT(NodeID) FROM CPUMultiLoad WITH (NOLOCK) against your DB)  While examining our database via Database Performance Analyzer (you have DPA, don't you!?) we noticed that the Alert Status query associated with our CPU alerts kept showing up as a source of blocking.  Blocking queries generate wait time for other queries while the consume resources to complete.  In this case the blocking appeared to be caused by a query that was taking a long time to execute.  I put on my SQL query detective's hat and went to work.

 

The core of the The Ultimate CPU Alert is a pair of INNER JOINS to get the number of CPUs for each node so that it can be compared to the Win_Processor_Queue_Len value that is captured by the Server and Application Monitor component.  The top of the query looks like this:

 

SELECT DISTINCT Nodes.NodeID AS NetObjectID, Nodes.Caption AS Name

   FROM Nodes

   INNER JOIN APM_AlertsAndReportsData  ON (Nodes.NodeID = APM_AlertsAndReportsData.NodeId)

   INNER JOIN

      (SELECT c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

     FROM

         ( SELECT DISTINCT CPUMultiLoad.NodeID,  CPUMultiLoad.CPUIndex FROM CPUMultiLoad )  c1

      GROUP BY c1.NodeID

      )  c2

   ON Nodes.NodeID = c2.NodeID

 

That innermost SELECT statement is where we run through the entire CPUMultiLoad table so that we can then select the NodeID and count of the CPUIndex.  Given that this view (as CPUMultiLoad is a view, not a table) has 144 million rows in our DB (and since you don't really change the number of CPUs on a server all that often) there might be a better way to perform this query.  Here's what we did:

 

   INNER JOIN APM_AlertsAndReportsData

   ON (Nodes.NodeID = APM_AlertsAndReportsData.NodeId)

   INNER JOIN

      (

  SELECT DISTINCT CPUMultiLoad.NodeID, MAX(CPUMultiLoad.CPUIndex)+1 AS CPUCount

  FROM CPUMultiLoad

  WHERE DATEDIFF(hh,TimeStampUTC,GETDATE()) <=4

  GROUP BY CPUMultiLoad.NodeID

    ) c1

  ON Nodes.NodeID = c1.NodeID

 

Instead of grabbing the data from CPUMultiLoad and then selecting it again we removed an INNER JOIN and selected the NodeID and MAX(CPUMultiLoad.CPUIndex) instead.  Of course, CPUIndex values start with zero and, since we want a count for comparison against the number of processes running against those CPUs, we added 1 and called it CPUCount.  In order to trim the number of rows returned from this SELECT statement (remembering that we have to INNER JOIN the select results (we called it c1) with the Nodes table) we added a WHERE DATEDIFF clause.  We take the time stamp in the CPUMultiLoad table and compare the difference, in hours, against the current date and then return only those rows where the hours are less or equal to 4.  Why 4?  Our Orion environment is set to Eastern time and UTC is 4 hours difference from Eastern time.  (I might do 6 hours -- just to be safe for daylight savings time, etc. -- but you get the idea!)

 

When we ran the two queries back-to-back we found that the updated query returned results 250% faster!  I'm not a SQL whiz by any stretch, but I definitely think this is a great step in the right direction.

 

How would you improve the query?

 

For the record, this is the entire query for the The Ultimate CPU Alert:

 

SELECT DISTINCT Nodes.NodeID AS NetObjectID, 

      Nodes.Caption AS Name 

   FROM Nodes 

   INNER JOIN APM_AlertsAndReportsData 

   ON (Nodes.NodeID = APM_AlertsAndReportsData.NodeId) 

   INNER JOIN 

      (

  SELECT DISTINCT CPUMultiLoad.NodeID, 

        MAX(CPUMultiLoad.CPUIndex)+1 AS CPUCount

  FROM CPUMultiLoad

  WHERE DATEDIFF(hh,TimeStampUTC,GETDATE()) <=4

  GROUP BY CPUMultiLoad.NodeID

    ) c1 

     

   ON Nodes.NodeID = c1.NodeID 

   WHERE Nodes.n_mute <> 1 

  AND Nodes.Prod_State = 'PROD' 

  AND APM_AlertsAndReportsData.ComponentName = 'Win_Processor_Queue_Len' 

  AND APM_AlertsAndReportsData.StatisticData <= c1.CPUCount 

  AND ( (nodes.CPU_Crit is null 

  AND nodes.CPULoad < 90) 

   OR (nodes.CPU_Crit is not null 

  AND nodes.CPULoad < nodes.CPU_Crit) )


Viewing all articles
Browse latest Browse all 21870

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>