Selective Packet Loss Troubleshooting on International Lines

Background

First, let me explain what selective or targeted packet loss is. This is just my description of certain problem scenarios, not professional terms. For example, MTU problem scenario is one of them. Data packets that exceed the fixed size limit cannot be transmitted normally, etc.

This article introduces a special packet loss scenario. After opening a new international line, the application developer found that he could not connect to the server. He thought there might be network packet loss, so he upgraded the problem for investigation.

Case study from SharkFest 2011 “Packet Trace Whispering”

Problem Information

The basic information of the packet trace file is as follows:

The trace file was captured by tcpdump on linux. The number of packets is not large, only 133, the length is truncated to 54 bytes, the file data size is 28k bytes, the capture time is relatively long 741.68 seconds, and the average rate is only 312 bps.

In the statistical session information, we can see the IP address information, infer that it has been anonymized, and there are 4 TCP flows.

The expert information is as follows. We can see that there are some Error problems at the protocol parsing level, Warning problems such as TCP ACKed unseen segment, and relatively common (suspected) retransmission and DUP ACK phenomena. The number is not large and further actual analysis is needed.

Problem Analysis

Expand the packet trace file to see the following packet details:

Selective Packet Loss Troubleshooting

By counting the session information, we can see that there are only 4 TCP streams. A quick filter and browse shows that TCP Steam 0-2 is basically normal, with no packet loss or retransmission .

Or you can click the black arrow to jump directly to the problem. You can clearly see the TCP retransmission and DUP ACK problems, which exist in TCP Stream 3.

To analyze TCP Stream 3, first we need to look at the TCP three-way handshake information.

  1. Server port 22, and then we can know that both ends are running SSH protocol version 2.0;
  2. IRTT is 0.243327 seconds, about 243 ms, which shows that the client is really far from the server through the international line;
  3. The MSS of both the client and server is 1460;
  4. The client supports SACK, but the server does not;
  5. In addition, TTL, client 122, server 52, can determine that the packet capture point is in the middle path.

Go to the TCP retransmission information location, the main analysis is as follows:

1. After receiving the data segment from the client No.112, the server acknowledges the data with No.113 ACK, but from then on, from No.114 to the end, there are only one-way data packets from the server;

2. In No.114 – No.120, the 7 consecutive data segments sent by the server were not confirmed by the client, which seemed to be packet loss. The client did not receive the relevant data packets;

3. After that, because the server did not receive the confirmation from the client, a timeout retransmission occurred. It can be seen that TCP retransmitted 7 times in an exponential backoff mode, No.121-No.125, No.130, No.133, with intervals of about 2.5s, 5s, 10s, 20s, 40s, 64s and 64s. Another rare phenomenon is that TCP performed an aggregate retransmission here, which was not a separate timeout retransmission of the previous 7 data segments, but was sent out at one time through a TCP data segment of 928 bytes;

a. It can be verified by TCP Seq Num, 2401 – 3329.b. It can also be verified by TCP Len, 88+212+68+308+100+100+52=928, which is 982 – 54 (14 Ethernet II headers + 20 IP headers + 20 TCP headers).c. The timeout retransmission behavior is quite special. It is not clear whether it is a certain 

timeout retransmission algorithm or a special behavior under a certain kernel version. If anyone knows, please let me know. Thank you.

4. Of course, data will not be retransmitted indefinitely and repeatedly. When a certain number of retransmissions is reached and there is still no confirmation returned, it will be judged that an abnormality has occurred in the network or the peer server, and TCP will forcibly close the connection and notify the upper-layer application that the communication abnormality has been forcibly terminated;

5. There are also TCP ACKed unseen segments. By comparing ACK Num 1313 with 1312, it seems that a FIN sent by the client has been confirmed. The subsequent DUP ACK has the same problem. The packet trace file does not seem to capture some packets from the client.

So what is the actual reason for packet loss? By comparing the fields of the data packets one by one, we found the following root causes:

1. The DSCP value of the data segment No.111 on the server side is the default 0, and the DSCP value of the ACK packet No.113 is also 0, including all server-side data packets before No.111, and the two-way interaction is normal ;

2. However, starting from No.114, the DSCP value of each data packet on the server side becomes 4, Unknown, undefined.

Final explanation:

Back to the initial description of the problem, because these data packets pass through international lines from end to end, either the source server (personal guess is a small probability) or the intermediate operator equipment (high probability) has modified the DSCP value of some data packets. In the subsequent transmission path, the DSCP value cannot match the normal forwarding, resulting in packet loss. Continuous retransmission still cannot solve the post-FIN connection. Therefore, from the client’s perspective, it is impossible to connect to the server.

Summary of the problem

Uncommon problems, strange phenomena and final causes, but everything is possible, right?