Incorrect Bandwidth reporting in NPM

September 27, 2010, 12:47 pm

We have had issues with incorrect bandwidth utilization reports for several years now with NPM. Primarily we have noticed this issue when interfaces report more bandwidth utilization that is physically possible - like 65Mbps+ on a DS3 and so forth. When tickets were opened on this issue Solarwinds reported back that our routers were sending the false information. I never completely believed this answer, but didn't have a good way to refute it. Recently we were bringing up a new OC3 and to test it we ran WAN Killer at 75Mbps both directions across this circuit. The traffic on the interface was a solid 75Mbps in both directions as there was no other traffic on the link. The Solarwinds graph for this OC3 interface show traffic rates from 65Mbps to 110Mbps.

This was obviously not a correct representation of the traffic on the link so I set up PRTG (another SNMP tool) to poll the same router. From PRTG the graph was a flatline at 75Mbps as expected. Which proves that the router is correctly reporting the bandwidth utilization on the link.

I've had an open ticket for several weeks now with Solarwinds on this issue, but still waiting on any kind of a fix or reason for the problem.

I'm curious if any other Solarwinds NPM users have experienced this problem.

↧

Solarwinds is still not stable

July 5, 2018, 6:38 am

≫ Next: Orion Platform 2018.2 Improvements - Chapter One

≪ Previous: Incorrect Bandwidth reporting in NPM

The other thread is closed so I figured I would start a new one I usually get more help here than actually contacting support.

So same issues as before but instead of the server not responding in 36 hours or so it took maybe a week but it is the SAME issues.

1. Server stopped sending alerts out sometime around 11AM on the 4th.

2. Logged onto server and opened Orion service manager and both the module engine and the administration service were going back and forth between running and stopping.

3. Orion could not connect to SQL

4. I have some alerts that at are going out but not sure if they are legit or not.

5. After the reboot I notice that a good chunk of my nodes interfaces are 'unknown' this looks like it fixes itself but again something else going on.

I have applied the 'hotfix' that you all pushed out to try to fix this.

I have done the change from streaming to buffered

I have done the registry change for the ports

The only thing I have not done is revert the snap shots back to June 14th prior to the update so Solarwinds is stable again.

At this point I am going to schedule a task in VM Ware to reboot the server every night. That is pretty much the only way I will know Solarwinds will actually work.

Thoughts? serena aLTeReGo

↧

Orion Platform 2018.2 Improvements - Chapter One

April 23, 2018, 1:14 pm

≫ Next: Muted Nodes Resource

≪ Previous: Solarwinds is still not stable

The time has come again for another exciting rundown of some of the improvements and enhancements coming your way in the next major installment of the Orion Platform. For those who may not be familiar, Orion is the foundational component upon which product modules such as Network Performance Monitor (NPM), Server & Application Monitor (SAM), and many others are built atop. Platform capabilities are available to, or can be leveraged by modules which run atop the Orion Platform. In most cases, those enhancements are available regardless of which Orion module(s) you are running, such as PerfStack. In others, it may be something which individual modules can extend to utilize for their own purposes, such as the Orion Agent which has been the basis for delivering amazing new capabilities from NetPath and QoE in NPM, to Application Dependency Mapping (ADM) and IaaS monitoring in SAM.

UPS Monitoring

Several years ago I created a Universal Device Poller (UnDP) for monitoring APC SmartUPS devices, and still to this day it remains amongst one of the most popularly downloaded UnDPs for NPM, if not the most popular. Universal Device Pollers are an incredibly powerful feature of NPM, allowing you to monitor virtually anything about a device which is managed via SNMP. However, there comes a time when certain functionality becomes so ubiquitous that it makes sense to promote it to native functionality of the monitoring solution and not require users to create it themselves. So in this 2018.2 release of the Orion Platform included with NPM 12.3, that's precisely what we set out to do, while also making some improvements along the way.

If you haven't already done so, you'll want to start by adding your APC UPS equipment to Orion. You can do so individually using the 'Add Node Wizard' [Settings > All Settings > Add Node], or in bulk using Sonar Discovery [Settings > All Settings > Discover Network]. If you are adding the devices using the 'Add Node Wizard', you will notice a new option listed for your APC UPS equipment entitled 'UPS'. Checking the box next to this option will enable UPS polling for this device.

List Resources	Power Control Unit Status Resource	UPS Firmware Version

Once you've successfully completed the 'Add Node Wizard' and navigate to the 'Node Details' view of your newly added UPS device, you will notice a newly added resource entitled 'Power Control Unit Status'. This resource reflects the most important information about your UPS device, including things such as its overall status, time on battery, and the batteries current charge capacity. This information can, as you would expect, be utilized in Alerts to notify you things such as when the UPS is on Battery, if a battery needs replacing, or if the battery is reaching an unsafe operating temperature. You may also notice that the 'Software Version' field in the "Node Details' resource now accurately reflects the firmware version installed and running on the UPS.

Currently, this new capability is limited exclusively to APC (American Power Conversion) SmartUPS Uninterruptible Power Supplies (UPS) containing Network Management (AKA Web/SNMP) cards. This feature does not support APC's unmanaged BackUPS series, nor does it yet support other UPS vendors, such as Eaton, Tripp Lite, or CyberPower. At least for now, we recommend using the Universal Device Poller to monitor similar metrics for UPS vendors other than APC. We will, however, be keeping a close eye on the NPM feature request forum to gauge interest in native support for other UPS vendors.

Linux/Unix Load Average

In a similar vein to UPS monitoring discussed above, we learned from speaking with our customers over the years, as well as from those participating in the Orion Improvement Program, that monitoring Load Average on Linux and Unix systems ranks among the most popular uses of the Universal Device Poller. In our enduring pursuit to deliver unexpected simplicity to our customers, we realized that collecting these important metrics natively was something which was long overdue.

Beginning in Orion Platform 2018.2, and included with NPM 12.3, Load Average is collected automatically for any node which supports it. This is typically any Linux based operating system, but can also extend to FreeBSD, AIX, and other Unix like OS's. The Load Average metrics are collected for nodes monitored via the Orion Agent, as well as those managed agentlessly via SNMP. There's really no additional steps required if you added your nodes using the default selection. Since Load Average has a direct correlation to CPU utilization, it's intuitively tied to the existing 'CPU & Memory' option shown under 'List Resources'. When selected, Load Average statistics are collected automatically if the node being monitored supports them.

List Resources - CPU & Memory	Load Average Resource

On the 'Node Details' view of your Linux servers, you will notice a snazzy new resource entitled 'Load Average' which displays the one minute, five minute, and fifteen-minute load average of the machine being monitored. Because Load Average metrics are tightly coupled to the number of CPU cores, we extended Orion's alerting to allow you to combine Load Average statistics with CPU count within your Alert Trigger so you can be notified when your system is under strain.

Load Average has also been added to the default PerfStack metrics for the node, meaning if you click on the 'Performance Analysis' button on the "Management' Resource of the 'Node Details' view for Linux server, you'll be taken to PerfStack where these Load Average statistics are automatically prepopulated. Similarly. if you're already working in PerfStack you can drag the node itself onto the chart area, the Load Average statistics, as well as other default metrics for the node will populate the PerfStack dashboard.

Group Availability

Ever since bshopp introduced us to Orion Groups back in NPM 10.1, we've heard from many of you that the manner in which availability is calculated for these groups just didn't jive with how you think about availability in your environment, nor did it provide a valuable measurement for use in your SLAs. Sadly, Group Availability in Orion is calculated binarily. Put simply, the group is either 100% 'Up' or it's 100% 'Down' regardless of the number of members contained within the group. What this usually meant was, so long as at least one member in the group was 'Up', the availability of the group was 100%. That remained true even if there were 99 other things 'down' in that group at that time. I know, it sounds odd when you say it aloud or even when you're writing it down, but that's how it's been for years and somehow we've managed the muddle through. Well in this release of the Orion Platform, no longer will you be forced to just muddle through. Today we heed your cries!

Rather than turn the world on its end, causing lots of confusion and alerts storms in our wake, we left the legacy Group Availability metric in place, untouched. I know that will come as a big relief to those of you which have grown dependant upon this method of calculating availability and have built reports and alerts around this metric. What we chose to do instead is introduce a new Group metric entitled 'Group Members Availability', which as one would expect, properly and accurately calculates the availability of the group based on its members. This includes nested groups as well.

This new 'Group Members Availability' metric appears automatically on the 'Group Details' view upon group creation. We will also start calculating this new metric upon upgrade to Orion Platform 2018.2 if you already have existing groups. So there's really nothing you need to do. We even include a new out-of-the-box report we refer to as 'Members Based Group Availability Report - Last Month' which serves as an example for how easily this metric can be added to your own reports compared to some of the complex SQL queries some had attempted to use in the past. You can even leverage this new Group Members Availability metric in your alerting conditions with no fuss!

And More!

There's still plenty more we've managed to jam pack into this release of the Orion Platform that we're particularly excited about and would love to get your feedback on. Stay tuned to learn about some of the mapping improvements jblankjblank has whipped up and the many usability enhancements serena has crammed into this release, such as sexy new hovers, a new PerfStack widget, and additional improvements that we've made to ensure your next upgrade experience is great!

↧

Muted Nodes Resource

December 20, 2017, 1:17 pm

≫ Next: Help to create the report (in SQL )for applications polling failed.

≪ Previous: Orion Platform 2018.2 Improvements - Chapter One

Muting of Nodes, Interfaces, and Applications is a great option added to Orion, but I've noticed that there wasn't an easy way to see these in a report. With that in mind, I added a custom report in my environment. There are three total variations of the report - for those running Network Performance Monitor, Server & Application, or both.

You can also easily see muted elements from the new Managed Entities view, but I want them quicker.

Managed Entities is restricted to pseudo-admins, and some of my users don't have access to that page, nor do I want them to have access. They could run the report, I don't always want them to run a report when they need a list of elements that are muted.

Since the report is based around SWQL, I can leverage the same to build a custom query resource on the Enterprise Dashboard.

For those new to the suite, I wanted to give you the step-by-step skinny on how I did this.

Start by clicking on the pencil icon on the top-left of the page to Customize the page.

Next, we've got to add a new widget.

Search in the Available Widgets for the Custom Query.

Now drag that Widget to a new location on your page.

In the Add Widgets bar, click "Done Adding Widgets."

In the Customize Page bar, click on "Done Editing."

The current resource is empty, now let's edit it.

Here's the meat and potatoes for the widget.

These are the settings that I'm using:

Title: Muted Alerts

Subtitle: Current or Scheduled Muted Alerts

Custom SWQL Query:

SELECT DISTINCT       CASE          WHEN [EntityUri] LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%' AND [EntityUri] NOT LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%/%'             THEN [N].[Caption]          WHEN [EntityUri] LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%/Interfaces/InterfaceID=%'             THEN [I].[FullName]          WHEN [EntityUri] LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%/Applications/ApplicationID=%'             THEN [AA].[FullyQualifiedName]          ELSE 'SomethingElse'       END AS [Element],       CASE          WHEN [EntityUri] LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%' AND [EntityUri] NOT LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%/%'             THEN [N].[DetailsUrl]          WHEN [EntityUri] LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%/Interfaces/InterfaceID=%'             THEN [I].[DetailsUrl]          WHEN [EntityUri] LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%/Applications/ApplicationID=%'             THEN [AA].[DetailsUrl]          ELSE 'SomethingElse'       END AS [_LinkFor_Element],       [AE].AccountID AS [By],       ToLocal([SuppressFrom]) AS [Start],              ToLocal([SuppressUntil]) AS [End]
FROM Orion.AlertSuppression AS [AlertSup]
LEFT OUTER JOIN Orion.Nodes AS [N]   ON [AlertSup].[EntityUri] = [N].[Uri]
LEFT OUTER JOIN Orion.NPM.Interfaces AS [I]   ON [AlertSup].[EntityUri] = [I].[Uri]
LEFT OUTER JOIN Orion.APM.Application AS [AA]   ON [AlertSup].[EntityUri] = [AA].[Uri]
LEFT OUTER JOIN Orion.AuditingEvents AS [AE]   ON [AE].AuditEventMessage LIKE CONCAT('%', CASE          WHEN [EntityUri] LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%' AND [EntityUri] NOT LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%/%'             THEN [N].[NodeName]          WHEN [EntityUri] LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%/Interfaces/InterfaceID=%'             THEN [I].[InterfaceCaption]          WHEN [EntityUri] LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%/Applications/ApplicationID=%'             THEN [AA].[Name]          ELSE 'Wrong'       END, '%') AND [EntityUri] LIKE CONCAT('%=', [AE].NetObjectID)
INNER JOIN Orion.AuditingActionTypes AS [AT]   ON [AE].ActionTypeID = [AT].ActionTypeID
WHERE [AT].ActionType IN  ( 'Orion.AlertSuppressionAdded', 'Orion.AlertSuppressionChanged' )
ORDER BY [SuppressFrom]

Search SQWL Query:

SELECT DISTINCT       CASE          WHEN [EntityUri] LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%' AND [EntityUri] NOT LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%/%'             THEN [N].[Caption]          WHEN [EntityUri] LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%/Interfaces/InterfaceID=%'             THEN [I].[FullName]          WHEN [EntityUri] LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%/Applications/ApplicationID=%'             THEN [AA].[FullyQualifiedName]          ELSE 'SomethingElse'       END AS [Element],       CASE          WHEN [EntityUri] LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%' AND [EntityUri] NOT LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%/%'             THEN [N].[DetailsUrl]          WHEN [EntityUri] LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%/Interfaces/InterfaceID=%'             THEN [I].[DetailsUrl]          WHEN [EntityUri] LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%/Applications/ApplicationID=%'             THEN [AA].[DetailsUrl]          ELSE 'SomethingElse'       END AS [_LinkFor_Element],       [AE].AccountID AS [By],       ToLocal([SuppressFrom]) AS [Start],       ToLocal([SuppressUntil]) AS [End]
FROM Orion.AlertSuppression AS [AlertSup]
LEFT OUTER JOIN Orion.Nodes AS [N]   ON [AlertSup].[EntityUri] = [N].[Uri]
LEFT OUTER JOIN Orion.NPM.Interfaces AS [I]   ON [AlertSup].[EntityUri] = [I].[Uri]
LEFT OUTER JOIN Orion.APM.Application AS [AA]   ON [AlertSup].[EntityUri] = [AA].[Uri]
LEFT OUTER JOIN Orion.AuditingEvents AS [AE]   ON [AE].AuditEventMessage LIKE CONCAT('%', CASE          WHEN [EntityUri] LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%' AND [EntityUri] NOT LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%/%'             THEN [N].[NodeName]          WHEN [EntityUri] LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%/Interfaces/InterfaceID=%'             THEN [I].[InterfaceCaption]          WHEN [EntityUri] LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%/Applications/ApplicationID=%'             THEN [AA].[Name]          ELSE 'Wrong'       END, '%') AND [EntityUri] LIKE CONCAT('%=', [AE].NetObjectID)
INNER JOIN Orion.AuditingActionTypes AS [AT]   ON [AE].ActionTypeID = [AT].ActionTypeID
WHERE [AT].ActionType IN  ( 'Orion.AlertSuppressionAdded', 'Orion.AlertSuppressionChanged' )  AND CASE         WHEN [EntityUri] LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%' AND [EntityUri] NOT LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%/%'            THEN [N].[Caption]         WHEN [EntityUri] LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%/Interfaces/InterfaceID=%'            THEN [I].[FullName]         WHEN [EntityUri] LIKE 'swis://%/Orion/Orion.Nodes/NodeID=%/Applications/ApplicationID=%'            THEN [AA].[FullyQualifiedName]         ELSE 'SomethingElse'      END LIKE '%${SEARCH_STRING}%'
ORDER BY ToLocal([SuppressFrom])

Number Of Rows Per Page: 5 (choose this number as you like)

Submit the changes and you are set.

Now you have quick access to all the elements that you have muted... oh and you get to see who requested the alerts being muted.

↧

Help to create the report (in SQL )for applications polling failed.

July 5, 2018, 3:23 pm

≫ Next: Interface Downtime on NOC screen

≪ Previous: Muted Nodes Resource

Hi,

Help to create the report (SQL format) for applications polling failed like"HELP TO CREATE THE COMMUNITY STRING FAILED REPORT?"

Scenario: Sometimes applications don't ping (in an unknown state), when we do testing (edit application> Test) after that it comes up.

↧

Interface Downtime on NOC screen

July 5, 2018, 2:53 pm

≫ Next: Can we track Orion services like who has restarted and when?

≪ Previous: Help to create the report (in SQL )for applications polling failed.

I would like to get this on our NOC somehow. I don't see any way to add this as a custom resource/object or anything to the NOC screen.
Any ideas?

↧

Can we track Orion services like who has restarted and when?

July 5, 2018, 3:40 pm

≫ Next: Control M Jobs failed Dashboard or count status?

≪ Previous: Interface Downtime on NOC screen

Hi, Can we track Orion services like who has restarted and when?

↧

Control M Jobs failed Dashboard or count status?

July 4, 2018, 7:09 am

≫ Next: Orion DB migration from SQL 2008 to SQL 2016

≪ Previous: Can we track Orion services like who has restarted and when?

Hello Team,

I just wondering for Control M jobs dashboard in Solarwinds. Is that possible like can we see the job details or JOB failed to count in solarwinds?

We are monitoring control m jobs via trap messages and converted critical jobs via Servicenow alert. or is there any possibilities through trap messages to get the Controm M job failed count in solarwinds?

↧

Orion DB migration from SQL 2008 to SQL 2016

May 21, 2018, 12:46 pm

≫ Next: Using Your Custom HTML Resource To Properly Display SWQL Query Results

≪ Previous: Control M Jobs failed Dashboard or count status?

Hi,

We would like to upgrade our existing DB in SQL 2008 to SQL 2016. Below is the plan which we created.

1. Take backup of existing DB in SQL 2008

2. Create a new server with Windows 2016 OS 64 bit

3. Install MS SQL 2016 SP1 Standard in the new server

4. Copy the backup(from SLQ 2008 DB) into the new SQL 2016 Database

5. Run configuration wizard in the Primary polling engine and addition polling engines

6. Check the application functionality

Can some one please review this. We need to know mainly on the restoring of backup from SQL 2008 to SQL 2016.

Is there any issues foreseen in this plan?

↧

Using Your Custom HTML Resource To Properly Display SWQL Query Results

June 29, 2018, 9:39 am

≫ Next: Orion Platform 2018.2 Improvements - Chapter One

≪ Previous: Orion DB migration from SQL 2008 to SQL 2016

↧

Orion Platform 2018.2 Improvements - Chapter One

April 23, 2018, 1:14 pm

≫ Next: Node Renaming Query

≪ Previous: Using Your Custom HTML Resource To Properly Display SWQL Query Results

UPS Monitoring

List Resources	Power Control Unit Status Resource	UPS Firmware Version

Linux/Unix Load Average

List Resources - CPU & Memory	Load Average Resource

Group Availability

And More!

↧

Node Renaming Query

July 6, 2018, 2:44 am

≫ Next: Node Notes resource for Summary Views

≪ Previous: Orion Platform 2018.2 Improvements - Chapter One

So, I've seen that bulk rename of nodes isn't available but there are some kludges / workarounds that I have seen. Not keen on using any of those right now as it means diving into the database.

My query is relatively simple, I guess.

I have a node that is discovered as, say ABC_123. If I edit that node and rename it to: Prefix-ABC_123 will that rename remain after a re-discover operation or will it default back?

↧

Node Notes resource for Summary Views

July 5, 2018, 2:59 am

≫ Next: What is the #1 networking problem you need to solve in the next 30 days?

≪ Previous: Node Renaming Query

Hi folks,

I am trying to create a resource that will display Node Notes for Unmanaged nodes.

So far I've tried doing this with a Custom Table and dynamic query builder but I can't find the query for Node Notes.

Only other option it seems is to create a custom swql/sql but I am looking for some help with joining data from NodesData and NodeNotes tables.

I wish to have the following included in the table:

Caption (hyperlinked), StatusLED, Unmanaged From, Unmanaged Until, Node Notes

Any help with this query would be greatly appreciated.

↧

What is the #1 networking problem you need to solve in the next 30 days?

January 29, 2015, 3:49 pm

≫ Next: Are your Orion server and SQL database server in the same Active Directory domain?

≪ Previous: Node Notes resource for Summary Views

Please expand on “Other” and why by adding a comment below.

↧

Are your Orion server and SQL database server in the same Active Directory domain?

June 13, 2017, 11:49 am

≫ Next: report schedule each minute

≪ Previous: What is the #1 networking problem you need to solve in the next 30 days?

↧

report schedule each minute

July 6, 2018, 5:24 am

≫ Next: Solarwinds is still not stable

≪ Previous: Are your Orion server and SQL database server in the same Active Directory domain?

Hi,

I´m looking for desperately any way to schedule a report with report Scheduler each minute.

Is there any hack to get this ?

↧

Solarwinds is still not stable

July 5, 2018, 6:38 am

≫ Next: NPM 12.3 Orion 2018.2 Upgrade Feedback

≪ Previous: report schedule each minute

The other thread is closed so I figured I would start a new one I usually get more help here than actually contacting support.

So same issues as before but instead of the server not responding in 36 hours or so it took maybe a week but it is the SAME issues.

1. Server stopped sending alerts out sometime around 11AM on the 4th.

2. Logged onto server and opened Orion service manager and both the module engine and the administration service were going back and forth between running and stopping.

3. Orion could not connect to SQL

4. I have some alerts that at are going out but not sure if they are legit or not.

5. After the reboot I notice that a good chunk of my nodes interfaces are 'unknown' this looks like it fixes itself but again something else going on.

I have applied the 'hotfix' that you all pushed out to try to fix this.

I have done the change from streaming to buffered

I have done the registry change for the ports

The only thing I have not done is revert the snap shots back to June 14th prior to the update so Solarwinds is stable again.

At this point I am going to schedule a task in VM Ware to reboot the server every night. That is pretty much the only way I will know Solarwinds will actually work.

Thoughts? serena aLTeReGo

↧

NPM 12.3 Orion 2018.2 Upgrade Feedback

June 4, 2018, 5:42 pm

≫ Next: SQL Database Questions

≪ Previous: Solarwinds is still not stable

What has your upgrade to NPM 12.3 on Orion Platform 2018.2 looked like? We on the product manager team would like to hear about it all, the good the bad and the ugly! For a starting point here is a quick getting started blog post on upgrading to 2018.2 Orion Platform: Preparing for the Upgrade to 2018.2

↧

SQL Database Questions

July 6, 2018, 8:24 am

≫ Next: Linux Memory Utilization Monitors and You

≪ Previous: NPM 12.3 Orion 2018.2 Upgrade Feedback

I have a ticket open with support but also wanted to see if anyone here could help. One of the DBAs on my team ran an indexing tool to check the health of the SQL database that is used by Orion and found that there are some indices that are missing and impacting performance of Solarwinds and there are also some deadlocks in the database. Anyone know if Solarwinds has recommendations on how they would fix this? We don't want to go in and fix and find out its the wrong way to fix things. We are running the latest version of SQL and it is at SP2. Thanks - Dave

↧

Linux Memory Utilization Monitors and You

February 17, 2012, 4:48 pm

≫ Next: Network Discovery Email

≪ Previous: SQL Database Questions

Our client is wondering why the values in Solarwinds do not reflect the values found on their servers:

top - 17:58:42 up  1:44,  1 user,  load average: 0.03, 0.06, 0.06
Tasks:  94 total,   1 running,  93 sleeping,   0 stopped,   0 zombie
Cpu(s):  3.7%us,  0.2%sy,  0.0%ni, 94.8%id,  1.2%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8174656k total,  1725996k used,  6448660k free,    39772k buffers
Swap:  8388600k total,        0k used,  8388600k free,   285544k cached

= ~21% Utilization

$ free -m             total       used       free     shared    buffers     cached
Mem:          7983       1684       6298          0         39        278
-/+ buffers/cache:       1366       6616
Swap:         8191          0       8191

= ~21% Utilization

Solarwinds = 17% utilization

Figuring that this was just a case of SNMP sending slightly different data I tried a basic snmpwalk against memory:

$ snmpwalk -v 2c -c xxxxxxxxxx localhost Memory
UCD-SNMP-MIB::memIndex.0 = INTEGER: 0
UCD-SNMP-MIB::memErrorName.0 = STRING: swap
UCD-SNMP-MIB::memTotalSwap.0 = INTEGER: 8388600
UCD-SNMP-MIB::memAvailSwap.0 = INTEGER: 8388600
UCD-SNMP-MIB::memTotalReal.0 = INTEGER: 8174656
UCD-SNMP-MIB::memAvailReal.0 = INTEGER: 6446020
UCD-SNMP-MIB::memTotalFree.0 = INTEGER: 14834620
UCD-SNMP-MIB::memMinimumSwap.0 = INTEGER: 16000
UCD-SNMP-MIB::memShared.0 = INTEGER: 0
UCD-SNMP-MIB::memBuffer.0 = INTEGER: 42552
UCD-SNMP-MIB::memCached.0 = INTEGER: 285616
UCD-SNMP-MIB::memSwapError.0 = INTEGER: 0
UCD-SNMP-MIB::memSwapErrorMsg.0 = STRING:

1-(memAvailReal/memTotalReal) = ~21%

Even when I manually enter the OIDs I receive the same basic results.

$ snmpwalk -v 2c -c xxxxxxx localhost .1.3.6.1.4.1.2021.4.5.0
UCD-SNMP-MIB::memTotalReal.0 = INTEGER: 8174656
$ snmpwalk -v 2c -c xxxxxxx localhost .1.3.6.1.4.1.2021.4.6.0
UCD-SNMP-MIB::memAvailReal.0 = INTEGER: 6400580

= ~21%

I'm having a hard time explaining to our client why Solarwinds is reporting a 4% lower utilization than they are seeing on the server itself. 4% could be the difference between an alert being generated or not, so you can see where the dilemma is coming from.

We have seen similar situations on Linux disk monitors, but in that case we are able to see how the values are being pulled more or less directly from SNMP. When we can fall back on Solarwinds using the SNMP reported data we are able to explain why utilization levels in Solarwinds do not reflect those on the server itself. In this case we are really at a loss for an explanation.

Is Solarwinds using a different OID? If so, is there a way to change the OID that is being used to the ones I just showed above without resorting to a UDP or something? Can someone provide me with the formula that is being used to calculate Memory Used on the CPU Load & Memory Utilization module?

Thanks in advance,

Bob

↧