This article first appeared in Network Computing.
Troubleshooting network issues as data rates increase towards 100G is becoming more and more mission critical. In order to successfully identify and remediate service-impacting issues and lower MTTR (Mean-Time-To-Resolution), ITOps need to monitor a wide range of metrics and data sources, including packet data, in real time.
Network infrastructure troubleshooting is a multiple-layer process – from the vague “something’s wrong” to the root cause analysis of a specific problem. The more disciplined the process and the better the understanding of the correlation between network behavior and issues impacting end-users, the faster problems can be resolved or handed off to appropriate teams for remediation.
The perennial challenge with this process is that user complaints are usually vague. Users (whether they are an employee, a customer, or even an algorithm that's sensitive to networking conditions) typically experience one of three things: “I can’t connect,” “The network is too slow,” or “My voice/video call quality is bad.” Since each of these can be caused by multiple underlying issues, IT teams often struggle to narrow things down. For example, a slow network could be caused by network, application or protocol latency, each of which might show itself through any one of a number of different metrics. To the frustrated end-user, it all looks the same – and much can be lost in translation.
To find the root cause and speed up issue resolution, IT teams need both the right tools for assessing network metrics and a clear view of the correlation between user experience, measurable network behavior, and underlying network issues. To illustrate, let’s walk through the troubleshooting process.
Step One: Collect the relevant metrics
Organizations rely on many sources and types of network data to provide context to end-user complaints. Their fundamental need is setting up the network monitoring infrastructure so that IT has access to packet data, flow data, events and telemetry data, and server KPIs. This will give them the insights they need to identify the root cause for various scenarios. There are particular metrics that are relevant to specific issues. For “the network is slow," the correlating metrics would be one-way latency, round-trip time, Z-Win, DNS or HTTP latency, throughput (Gbps), packets per second (PPS), connections per second (CPS), or concurrent connections (CC). For “quality is poor,” look at jitter, sequence errors, retransmissions, and fragmentations. When “connectivity” is the problem, examine ICMP, HTTP, and SYN/ACK errors.
Wait, there’s more! To read the full article, please click here.
Muhammad Haseeb is Director of Product Management at cPacket Networks.