Network Performance Monitoring: What to Watch and Why

Back to Blog

Most network problems don't announce themselves. A switch port degrading over weeks doesn't send a memo. A WAN circuit operating at 94% utilization during business hours doesn't file a ticket. A misconfigured QoS policy silently throttling your VoIP traffic doesn't show up in your help desk queue until users are already complaining about dropped calls. By the time a network issue is visible to end users, it has typically been developing — and measurable — for hours, days, or longer.

The difference between reactive and proactive network monitoring is the difference between your IT team spending their days responding to crises and spending their days preventing them. This guide covers the metrics that matter, the data collection technologies that surface them, how to configure alerting thresholds that are actually useful, and how IT Center’s 24/7 NOC monitoring stack fits into a complete monitoring architecture.

The Four Core Performance Metrics

Every network performance conversation starts with four fundamental metrics. Everything else is context built on top of these.

Latency is the round-trip time (RTT) for a packet to travel from one point to another and back. It's measured in milliseconds. For internal LAN traffic, latency should be sub-millisecond. For traffic crossing your WAN to a cloud data center, 5–15ms is typical on a well-provisioned circuit. For internet traffic to a remote SaaS application, 20–50ms is normal depending on geographic distance. What matters is consistency and baseline deviation — a link that normally shows 12ms round-trip time that suddenly shows 85ms has a problem, even if 85ms is technically within some published SLA.

Latency has an outsized impact on real-time applications. VoIP calls become difficult to understand when one-way delay exceeds 150ms. Video conferencing becomes disruptive above 100ms. Remote desktop sessions become frustrating above 80ms. The latency thresholds that matter are not generic industry benchmarks — they're the thresholds at which your specific applications become noticeably worse for your users.

Jitter is the variation in latency over time — specifically, the inconsistency of the delay experienced by successive packets on the same path. A connection with 20ms average latency but ±15ms jitter is far more problematic for VoIP and video than a connection with 35ms average latency and ±2ms jitter. Real-time applications depend on predictable delivery intervals; jitter destroys that predictability. For VoIP, jitter above 20–30ms begins causing audible artifacts. Acceptable VoIP jitter is typically under 10ms.

Packet loss is the percentage of transmitted packets that never arrive at their destination. At the protocol level, TCP handles packet loss through retransmission — the sender detects the missing acknowledgment and resends the packet. But retransmission takes time, and on a connection with 1% packet loss, that overhead compounds into significant throughput degradation. UDP-based applications like VoIP and video streaming don't retransmit — a lost packet is simply gone, manifesting as a click, dropout, or frozen frame. Even 0.5% sustained packet loss is noticeable in VoIP quality. Above 3–5%, most voice calls become unusable.

Bandwidth utilization measures what fraction of a link's capacity is actually being used. A 1Gbps uplink running at 950Mbps is not a stable configuration — at that utilization level, any burst of traffic triggers queuing delays and packet drops. A general guideline is to plan for alerts at 70–75% sustained utilization on critical links, with investigation triggered well before you approach saturation. Bandwidth utilization data is also the starting point for capacity planning: if your primary WAN circuit is growing 8% per month, you have a specific timeline before you need to upgrade it.

SNMP Polling: Device-Level Visibility

Simple Network Management Protocol (SNMP) is the foundational technology for polling performance data from network devices — routers, switches, firewalls, access points, UPS units, and more. Nearly every enterprise-grade network device supports SNMP. A monitoring system sends periodic SNMP queries to each device, and the device responds with the current values of its Management Information Base (MIB) variables: interface counters, CPU utilization, memory usage, error counts, temperature sensors, and dozens of other data points depending on the device.

SNMP v3 is the current standard and should be the only version deployed in new configurations. SNMP v1 and v2c transmit community strings (effectively passwords) in cleartext and are trivially intercepted. SNMP v3 adds authentication and encryption. If your monitoring system is still querying devices with SNMPv2c community strings, that's a security gap that should be addressed in your next maintenance window.

The polling interval determines how granular your performance data is. A five-minute polling interval is the traditional default and is adequate for trend analysis and capacity planning. For environments where rapid detection of interface errors or CPU spikes is critical, one-minute polling provides more timely visibility — at the cost of higher device CPU load and more data storage. For most SMB environments, a two-minute polling interval balances granularity with overhead.

Key SNMP metrics to monitor on every network device:

  • Interface utilization (in and out) — in bytes/sec and as percentage of interface capacity
  • Interface error counters — CRC errors, runts, giants, input/output errors accumulating over time
  • Device CPU utilization — sustained high CPU on a router or switch indicates a routing or processing problem
  • Device memory utilization — memory exhaustion on a network device causes degraded forwarding performance
  • Interface operational status — link up/down state changes, with alerting on unexpected transitions
  • BGP/OSPF neighbor state (where applicable) — routing protocol adjacency loss is often the first indicator of a WAN or upstream problem

NetFlow and sFlow: Understanding Traffic Composition

SNMP tells you how much traffic is flowing through an interface. NetFlow and sFlow tell you what that traffic is. This distinction matters enormously for troubleshooting and security.

NetFlow (originally a Cisco technology, now an IETF standard as IPFIX) exports records of every TCP/UDP flow that transits a device — source IP, destination IP, source port, destination port, protocol, byte count, and packet count. A NetFlow collector aggregates these records and allows you to query them: "Show me the top 10 talkers on this interface for the last hour" or "Show me all traffic to external IP 203.0.113.45 in the last 24 hours." NetFlow is invaluable for identifying bandwidth hogs, detecting anomalous external connections, verifying that QoS policies are actually classifying traffic correctly, and performing post-incident traffic analysis.

sFlow is a packet sampling technology that achieves similar visibility at lower device overhead by exporting a statistical sample of packets (typically 1 in 1,000 or 1 in 512) rather than recording every flow. sFlow is common on switches and non-Cisco networking hardware. The trade-off is that sFlow data is probabilistic rather than exact — it provides excellent visibility into traffic patterns and top talkers, but should not be used for precise flow-by-flow accounting where exactness is required.

NetFlow/sFlow analysis is also a first-line security tool. Sudden appearance of large outbound flows to unfamiliar IP ranges, lateral movement between internal hosts that don't normally communicate, or consistent beaconing to a specific external destination at regular intervals are all patterns that show up clearly in flow data — and that are invisible to SNMP-only monitoring.

Security note: More than 60% of network security incidents involve patterns that appear in NetFlow data before any endpoint security tool triggers an alert. Flow analysis is not just a performance tool — it's a critical layer of your security visibility stack.

Syslog Aggregation: The Event Layer

Every network device continuously generates log messages describing its operational state: authentication events, configuration changes, interface state transitions, routing protocol events, firewall rule matches, DHCP lease activity, and error conditions. Syslog is the standard protocol for shipping these messages from devices to a central collection point.

Without centralized Syslog aggregation, log data lives fragmented across individual devices with limited storage and no cross-device correlation capability. With a central Syslog server or SIEM, you can search all device logs simultaneously, correlate events across devices (a firewall block event followed by a VPN authentication attempt from the same source IP, for example), and maintain an audit trail of every configuration change on every device.

For compliance-driven environments — businesses subject to PCI DSS, HIPAA, or similar frameworks — centralized log retention is not optional. PCI DSS requires at least 12 months of log retention with 3 months immediately available for analysis. Meeting that requirement through fragmented per-device logs is impractical; centralized Syslog collection makes it straightforward.

Alerting Thresholds: Calibrating for Signal, Not Noise

A monitoring system with poorly calibrated alerts is worse than no monitoring system. Alert fatigue — the state where the team ignores alerts because they're so frequent and so often false positives — is one of the most common and most dangerous failure modes in network operations. When every threshold breach generates a page or a ticket, real events are lost in the noise.

Effective alerting thresholds share several characteristics:

  • Baseline-relative, not absolute. An alert set at "CPU over 80%" will fire every time your router processes a large routing table update — which may be perfectly normal. An alert set at "CPU over 80% for more than 10 consecutive minutes" or "CPU over baseline by more than 40 percentage points" is far more meaningful.
  • Sustained, not instantaneous. Nearly every metric spikes briefly for legitimate reasons. Alerting on a single data point produces false positives. Alerting when a threshold is exceeded for multiple consecutive polling intervals eliminates the noise while still detecting real problems.
  • Severity-tiered. Not every threshold breach warrants a 3 AM page. A WAN circuit at 75% utilization is a trend to investigate tomorrow. A WAN circuit at 98% utilization with packet loss is a call-now event. A core switch that has gone unreachable is wake-someone-up severity. Your alerting system should match notification urgency to actual business impact.
  • Time-aware. A bandwidth utilization spike during business hours may be concerning. The same spike at 2 AM is more concerning, because it falls outside the pattern of normal business activity and may indicate a security event or unauthorized usage.
Metric Warning Threshold Critical Threshold
WAN circuit utilization 70% sustained 10 min 90% sustained 5 min
Latency (LAN) >5ms average >20ms or 3x baseline
Latency (WAN/Internet) >2x baseline >3x baseline or >150ms
Packet loss >0.5% sustained >2% sustained
Jitter (VoIP environments) >10ms >20ms
Device unreachable 1 failed poll 3 consecutive failed polls

IT Center 24/7 NOC: Inside-Out and Outside-In Monitoring

Standard monitoring tools tell you what’s happening inside your network. The IT Center NOC pairs internal SNMP, NetFlow, and syslog telemetry with an outside-in perspective — monitoring how your network and internet-facing services appear from external vantage points and correlating both views to give a complete operational picture.

External probes continuously test your internet circuit performance — latency, packet loss, and throughput — from multiple geographically distributed measurement points. This external perspective is critical because it separates problems that are internal to your network from problems that originate with your ISP. When users report that “the internet is slow,” the question is always whether the problem is your LAN, your WAN router, your ISP’s network, or a BGP routing issue somewhere upstream. Our NOC stack answers that question definitively, within minutes of the problem starting, without waiting for your ISP’s support queue.

The same monitoring layer covers service availability for your critical external dependencies — your hosted applications, cloud platforms, and SaaS services — and correlates availability events with your internal performance metrics. When your cloud-hosted ERP becomes unreachable, our engineers determine whether the problem is your internet circuit, a DNS resolution failure, the hosting provider’s infrastructure, or a BGP routing incident, and alert your team with that context already in hand.

Proactive vs. Reactive: The Business Case

The argument for proactive monitoring is simple arithmetic. The average cost of unplanned network downtime for a small business — counting lost productivity, potentially missed revenue, and the emergency labor to diagnose and fix the problem — is several thousand dollars per hour. A proactive monitoring platform that catches a degrading WAN circuit before it fails, catches an interface accumulating CRC errors before it causes packet loss, or catches a switch approaching thermal limits before it shuts down, prevents that cost entirely.

The businesses that invest in proactive network monitoring also tend to have better conversations with their ISPs and vendors. When you can present a timestamped graph showing that your circuit's latency increased from 12ms to 87ms starting at 2:14 PM on a specific date and has been elevated ever since, you get faster resolution than the business that calls and says "our internet has been slow for a few days." Data is leverage in those conversations.

Get Proactive Network Monitoring for Your Business

IT Center’s 24/7 NOC and managed network monitoring services give Southern California businesses the visibility they need to stop reacting to network problems and start preventing them. Ask us about our monitoring coverage and how it integrates with our network infrastructure services.

Explore Network Monitoring

Also see: Network Infrastructure  |  Call: (888) 221-0098

Back to All Articles