Thursday just before the end of the week, after a team meeting I received the call. We all know it, it starts with "what the #$#@ is going on with the network!? (for some reason it always gets blamed on network never the servers) The issue reported was: Users in the east side of the building were having issues with slow logins, slow email(whatever that means) , application timeouts, and inability to log into other applications. I logged into solarwinds, checked:
- Network Utilization---All looks fine, the switch in question isn't even in the top five for any list.
- Check Virtualization manager---No new issues
- Check SAM--no server errors
- Checked VSphere--No alerts
- Pinged servers that were reported giving issues----Pinged fine
- Pinged workstations and printers in the department having issues--Pinged fine
- VOIP phones working at 100%
The network admin logged into the switch and noticed there were a moderate level of discards on one of the two links from the switch, he shut off the one reporting discards and it seemed that it fixed the problem. The amount of discards was not high enough to stand out from our other switches.
Issue with that "fix" is it shouldn't have made THAT much of a difference as each switch has two 1gig fiber links to the main building switch. So if one was having issues traffic could just go over the other with little impact.
Our HIPPA guy has been messing with IPSEC, it happens to be in this wing of the building where the issues were occurring, I know little about IPSEC but from what I read, it can cause a bottleneck, but he says he didn't enable it on anything yet....
I will have to answer to my boss on Monday as to why this was found sooner. I am not sure what to tell him.
I put together a quick diagram, switch 2 is the one where issues were reported. Any thoughts as to what I may have misses when looking for the bottleneck?