We recently started getting errors from SAM/NPM where it'd report a bunch of nodes "down" - because it couldn't ping them. Some sort of network congestion (that was perfectly healthy for years before last week): high response time on multiple nodes, high number of errors, some applications going down as well. Would appreciate if you could share tips and best practices, how to troubleshoot these.