RUCKUS Forums - Latency! And how to identify and troubleshoot it

jdryan · ‎10-24-2023

In a network, latency can occur at any moment. However, given that we have a varied end-user pool, the impact of this can vary in scale. Therefore, keeping a close eye on it would prove vital.

Given that in a deployment, we see various devices from various vendors. The scope of the article is written with troubleshooting and isolation procedures in hand with vendor neutrality.

When there is a report of slowness of traffic, isolation is needed as in:

Is the latency on the traffic bound towards the internet.
Is the latency on the local site traffic.

1> If the latency is seen:

When accessing data or application hosted remotely on a cloud server or on another site
When connecting to the internet and web pages load slowly

This would be external to the network.

Here for this,
Step by step elimination is needed, to know at what point the latency is coming up and is being seen.

in this scenario: it would generally appear on/above core switch level moving towards the internet connection [ISP’s CPE].

And the BW provided by the ISP can also contribute to this, if in case there is a large resource dependency on the remotely hosted data for the user base.

Gauging this can be done using:

Remotely hosted Iperf server [ there are a few that are available publicly Click here ]
This can help with BW assessment.

https://speed.cloudflare.com/ : this can help gauging the ISP connection for latency
As this gives detailed statistical output of :
Packet Loss measurement
Latency measurements
Jitter
Download and upload measurements.
Including uplink and downlink speeds on the connection.

2> If the latency is seen:

When accessing data or application hosted locally on site
When connecting to the application or servers the data loads slowly.

Note :

"When connecting to the application or servers the data loads slowly." Here, the given condition is :
That the servers or the set of servers are not burdened with many concurrent users
And the servers in the set up are having the specifications to handle a given user load and function well under load.

This would be internal to the network.

Here the isolation would need to be followed in the below manner:

Are the user and the resource that’s being connected to located in the same subnet or across subnet:
If in the same subnet:

then would also need to check if there is any trouble on the end points.
Would need to check if there are or is any traffic suppression going on. : like excessive broadcast and more.

If across subnets:

First check for traffic suppression and/or redirection
Redirection at times can happen when working across north-south bound traffic.

During what operation is the latency seen, to name a few:
During a file transfer
During remote access/RDP session
During video conferencing / VoIP calls

This would help us identify the type and nature of traffic getting impacted.

If during Data transfer / file transfer:

In this case: check if there is any resource usage and or frame loss being seen during operation.
If the above, comes clean and yet it’s still seen check the traffic flow pattern.
traffic suppression and/or redirection, this can cause there to be re-transmits to be seen, usually they are the tell-tale indicators here.
Once this pattern is seen and established, appropriate measures are needed to be employed.

If during remote access or RDP session:

This is here is a example of application specific concern

Here if there are multiple users/RDP sessions in progress: there is a chance you can see the disconnections/timeouts/interruptions/drops etc.

And on the LAN devices you will see drop counters increment on the interface.
More specifically interface queues.
Usually seen on Q0 as this where all default traffic would go.
RDP traffic by default will be processed/traversed here on Q0

On this case as this comes downs to specific queue:

You can try having the minimum guaranteed bandwidth set for the interface queue: this should re-allocate buffer bandwidth to necessary queues.
This option is provided by most vendors: here's ours

If that does not help, then QoS re-marking would be the other effective option for the traffic prioritization over the network.

If during video conferencing /VoIP calls:

This is here is an example of traffic specific concern

Here it would more towards on the network: with VoIP networks usually QoS is already in place.

Hence, it’s can possibly be a rare event for the VoIP Network to have that.

However, Video conferencing applications like Teams and Zoom this can happen.

For here traditional QoS settings may work to an extent …... But not completely

As this would be application traffic.

Hence here we explore two options: to make effective use of the underlying LAN's QoS /CoS features.

1> easy way approach: have the application mark the DSCP on the exiting traffic.
Zoom has this option: and with Admin Privilege that can enabled for the organization.
click here

Once this traffic is marked: have the network trust it : this will ensure the traffic receiving proper priority when being forwarded across the network.

2> Network based approach:

Here the application will not have a feature to mark the traffic.
However, there will be a set of TCP/UDP ports that the application uses for each task
Teams we have this bit when it uses a certain range of TCP/UDP ports for
Audio, Video and content sharing
click here
Based on this: on the network where all traffic converges : usually at the core
Have a L3 QoS DSCP policy set marking the respective traffic based on these ports.
And have rest of the network devices: trust the data stream coming from the core.

Other important bits that can help identify latency:
After the above isolation is done: to understand whether the latency seen is external or internal in nature.
Below are a few Ways to find out if latency turning up in the network.

Ping stats:

ping 1.1.1.1 -n 10

Pinging 1.1.1.1 with 32 bytes of data:

Reply from 1.1.1.1: bytes=32 time=46ms TTL=56

Reply from 1.1.1.1: bytes=32 time=47ms TTL=56

Reply from 1.1.1.1: bytes=32 time=46ms TTL=56

Reply from 1.1.1.1: bytes=32 time=47ms TTL=56

Reply from 1.1.1.1: bytes=32 time=44ms TTL=56

Reply from 1.1.1.1: bytes=32 time=45ms TTL=56

Reply from 1.1.1.1: bytes=32 time=44ms TTL=56

Reply from 1.1.1.1: bytes=32 time=52ms TTL=56

Reply from 1.1.1.1: bytes=32 time=143ms TTL=56

Reply from 1.1.1.1: bytes=32 time=50ms TTL=56

Ping statistics for 1.1.1.1:

Packets: Sent = 10, Received = 10, Lost = 0 (0% loss),

Approximate round trip times in milli-seconds:

Minimum = 44ms, Maximum = 143ms, Average = 56ms

Here in a sample space of ten pings:

the highest response time outlier seen is at 143ms.

the lowest response time outlier seen is at 44ms.

the sample space is seen to average out at 56ms.

Here the things to look out for: lowest/minimum response time and average response time.

This should not be high: if the same or either are high then the same does indicate there is latency.

As highest that we see is selected from the highest of all data points.

As lowest that we see is selected from the lowest of all data points.

Hence the lowest would be a better measure.

NOTE:
Its suggested that Ping should be done from host to host and not from a switch/AP/Router or to a switch/AP/Router for detection of latency as switch/AP/Router process pings differently.
Ping can be used to check if the device is responding or not.

Iperf Test:

This helps gauge the BW seen by the end client through the network.
Also gives a fair understanding of whether the traffic traversing a specific direction faces any choke points.

Here once this is done checking path link by link would also help: as BW between two end points can be impacted by the BW of the interconnecting links as well including the device itself.

Network assessment:

This would encompass the checking of redirection based issues, retransmits.
Traffic suppression, impact due to broadcast storms and loops [ physical and logical] etc.

Abbrevations :
ISP > Internet Service Provider
CPE > Customer Premises Equipment