Network Performance Troubleshooting: Fix Delays and Packet Loss

As a network administrator, a significant portion of time is inevitably spent on Network Performance Troubleshooting, particularly when dealing with sluggish servers and various network endpoints. However, just because users experience a sluggish network doesn’t necessarily indicate a bandwidth problem. To solve network performance problems, we first start with TCP error recovery (TCP retransmission and duplicate ACK) and flow control. Then we explain how to find the source of slow network traffic. Finally, we analyze the data flow on each component of the network. These contents will help readers identify, diagnose, and troubleshoot slow network traffic effectively.

More information
The following content is mostly black and white pictures. Although it looks a bit uncomfortable, it is still worth watching.

TCP error recovery function:

TCP’s error recovery feature is the best tool for locating, diagnosing, and fixing network delays. Delay can be measured in both one-way and round-trip directions. High latency is the number one enemy of network administrators. In this section, we discuss how high TCP latency can cause sequence and acknowledgment number out of order.

TCP Retransmission:

Host message retransmission is the most basic error recovery function of TCP, and its purpose is to prevent message loss.

There are many possible reasons for packet loss, including application failure, routing equipment overload, or temporary service downtime. The packet level speed is very high, and packet loss is usually temporary, so it is particularly important for TCP to be able to detect and recover from packet loss.

The main mechanism for determining whether a message needs to be retransmitted is the retransmission timer, whose main function is to maintain the retransmission timeout (RTO) value. When a message is transmitted using TCP, the retransmission timer starts and stops when an ACK is received. The time from when a message is sent to when an ACK is received is called the round trip time (RTT). The average of several times is used to determine the final RTO value. Before the final RTO value is determined, the retransmission timer is used to determine whether packet loss occurs in each message transmission. The following figure illustrates the TCP retransmission process.

Network Performance Troubleshooting

After a message is sent, but the receiver has not yet sent a TCP ACK message, the sender assumes that the source message is lost and retransmits it. After the retransmission, the RTO value is doubled; if the ACK message is not received before the 2 times RTO value is reached, it will retransmit again. If the ACK is still not received, the RTO value is doubled again. This continues, and the RTO is doubled for each retransmission until an ACK message is received or the sender reaches the configured maximum number of retransmissions.

The maximum number of retransmissions depends on the configuration value of the sending operating system. By default, Windows hosts retransmit 5 times by default. Most Linux systems default to a maximum of 15 times. Both operating systems are configurable.
An example is shown below:

The first message sent during TCP retransmission is shown in the following figure (the picture is not very clear, I have tried my best):

This is a TCP PSH/ACK message①, containing 648 bytes of data②, sent from 10.3.30.1 to 10.3.71.7. This is a typical data message.

Normally, a TCP ACK message is received soon after the first message is sent. However, in this case, the second message is a retransmission. You can see this in the Packet List panel. The Info column clearly states “TCP Retransmission”, and the message is marked in red font on a black background. The following figure is an example of a retransmission in the Packet List panel (still unclear, but see the figure above):

You can also check in the Packet Details and Packet Bytes panels to determine whether it is a retransmitted message, as shown in the following figure:

Note that this message is identical to the original message (except for the IP identifier and checksum field). To verify this, compare the Packet Bytes① of the two messages.

In the Packet Details panel, notice that the retransmitted packet has some additional information under SEQ/ACK Analysis②. This information is provided by Wireshark and not the packet itself. SEQ/ACK Analysis tells us that this is indeed a retransmitted packet, and the RTO value is 0.206 seconds. The RTO at this time is based on the time increment of packet 1.

Checking the remaining packets will give similar results, with the only differences being the IP identifier and checksum, as well as the RTO value. To visualize the time interval between packets, look at the Time column in the Packet List panel, as shown in the figure below. Here you can see the doubling relationship of the RTO value.

TCP repeated ACK and fast retransmit:

Duplicate ACKs are a type of TCP message sent when the receiver receives out-of-order messages. TCP uses the sequence number and acknowledgment number in the message header to effectively ensure that data is received and reassembled in the order it was sent.

When a TCP connection is established, one of the most important information exchanged during the handshake process is the initial sequence number (ISN). Once both parties have set the ISN, the sequence number of the next message sent will be increased by a data payload value.

Assume that a host with an ISN of 5000 sends a 500-byte message to the receiver. Once the message is received, the receiver replies with a TCP ACK message with an ACK number of 5500, based on the following formula:

Sequence Number In + Bytes of Data Received = Acknowledgment Number Out

According to the above calculation results, the confirmation number returned to the sender is actually the sequence number that the receiver hopes to receive. An example is shown in the following figure:

The data receiver uses sequence numbers to check for message loss. By tracking the received sequence numbers, the receiver can confirm whether the sequence numbers are out of order. When the receiver receives an abnormal sequence number, it assumes that a message was lost during transmission. In order to retransmit the data correctly, the receiver must have the lost message, so it sends an ACK message containing the correct sequence number of the lost message so that the sender can retransmit the message.

When the retransmitting host receives three duplicate ACKs from the sender, it assumes that the message was indeed lost in transmission and immediately sends a fast retransmit. Once a fast retransmit is triggered, all other messages being transmitted are placed in a queue until the fast retransmit message is sent. The process is shown in the following figure:

Continuing from the above color pictures:

The first message in this example is as follows:

This is a TCP ACK message, sent from the data receiving end (172.31.136.85) to the sending end (195.81.202.68)①, confirming the data sent by the previous message.
The confirmation number in this message is 1310973186②, which should be the sequence number of the next received message, as shown in the following figure:

Unfortunately, the sequence number at the receiving end is 1310984130①, which is not the expected value. This means that the message was lost in transmission. The receiving end noticed that the messages were out of order and sent a duplicate ACK in the third message, as shown in the following figure:

This can be confirmed as a duplicate ACK in one of two ways:

In the Info column of the Packet Details panel, the message is displayed in red fonts on a black background.

The Packet Details panel under SEQ/ACK Analysis. Expand this column and you will find that the message is displayed as duplicate ACK. The process repeats for the next few messages. As shown in the figure below:

The fourth message in this file is another data block with the wrong sequence number ① sent by the sender. Therefore, the receiver sends a second duplicate ACK ②. The receiver receives another out-of-order message ③. This triggers the third and final duplicate ACK ④.

Once the sender receives the third duplicate ACK from the receiver, it will suspend all transmission messages and retransmit the lost messages. The following figure shows the fast retransmission process:

Retransmitted messages can also be observed in the Info column of the Packet List panel. The message is displayed in red font on a black background. The SEQ/ACK Analysis section of this message tells us that this may be a fast retransmit frame. (The information that identifies the message as a fast retransmit is not contained in the message itself, but is a function of Wireshark). The last message is the ACK received for the fast retransmit.

Summary

This content provides a comprehensive guide on Network Performance Troubleshooting, focusing on TCP error recovery and flow control to address slow network issues. It emphasizes that sluggish network performance isn’t always due to bandwidth problems. The guide begins with TCP error recovery, detailing how TCP retransmission and duplicate ACKs help diagnose and fix network delays. It explains the role of the retransmission timer and the importance of the round trip time (RTT) in determining the retransmission timeout (RTO) value. The content also covers how duplicate ACKs and fast retransmit mechanisms work to ensure data integrity and order. Visual aids, although not very clear, are included to illustrate these processes. This guide is essential for network administrators to effectively identify, diagnose, and troubleshoot network performance issues.