Network Video App Packet Loss Troubleshooting

Background

In this case study of packet loss troubleshooting, a network video app could not be accessed via mobile Wi-Fi or data traffic, although it worked fine with telecom traffic. Initially, it seemed that the video service was being blocked by the mobile operator. However, upon analyzing the packet tracking data, it became clear that packet loss was the primary issue. This article explores the steps taken to diagnose and resolve the issue through detailed packet loss troubleshooting.

Ok, I blamed you wrongly this time, move~ I will simply record the troubleshooting process.

Packet Loss Troubleshooting: Problem Overview

telecommunications

The basic information of a normal packet trace file (TV-01.pcapng) is as follows. The trace file is captured by Wireshark on Windows. The number of packets is 1633 and the average rate is 776 kbps.

move

The basic information of the abnormal data packet trace file (TV-02.pcapng) is as follows. The trace file is also captured by Wireshark on Windows. The number of packets is 803, and the average rate is only 97 kbps. Compared with China Telecom, it is indeed abnormal and basically matches the 0.0kb/s problem phenomenon displayed by the video APP playback.

A simple browse of the expert information shows that there are some out-of-order, retransmission and fast retransmission phenomena.

Identifying Packet Loss in Mobile Data Traffic

telecommunications

Entering the actual data packet analysis, we first start with normal telecommunications, where TCP Stream 4 is basically normal, including the TCP three-way handshake, client GET request, server HTTP 206 response, and video data transmission.

Packet Loss Troubleshooting

move

In the abnormal mobile data packets, some TCP streams have obvious suspected retransmission phenomenon. The following takes Stream 1 as an example to illustrate

The fundamental analysis is as follows:

  1. The capture was performed on the client, because the packet length was only 54 bytes and was not padded to the minimum standard Ethernet frame length of 60 bytes before entering the network card for transmission;
  2. TCP three-way handshake, including MSS 1400, SACK support, etc.
  3. TCP suspected retransmission, packets with a length of 1106 bytes are repeatedly suspected of retransmission and DUP ACK problem phenomenon;
  4. The IRTT time is about 24ms.

What is it ? Actually, this is a correlation analysisTCP Spurious Retransmission of the context data packet by Wireshark . To put it simply, I saw a TCP segment in the front and also saw the confirmation of this segment. One round trip has completed an interaction, but then I saw the same TCP segment again. To me, it is a suspected retransmission, which will naturally trigger the generation of a DUP ACK.

When Nagle meets delayed ACK

In fact, this problem is a very classic problem when Nagle encounters delayed ACKproblem. The following will not elaborate on the Nagle algorithm and delayed ACK, but quote some online materials:

Nagle’s Algorithm

Specific steps:

  • If the content to be sent is greater than or equal to 1 MSS, it will be sent immediately;
  • If there is no packet that has not been ACKed before, send it immediately;
  • If there is a packet that has not been ACKed before, cache the sent content;
  • If an ACK is received, the buffered contents are sent immediately.

Delayed ACK

TCP Delayed ACK (delayed confirmation) In an effort to improve network performance, it combines several ACK response groups into a single response, or sends the ACK response together with the response data to the other party, thereby reducing protocol overhead.

Specific steps:

  • When there is response data to be sent, ACK will be sent to the other party immediately along with the response data;
  • If there is no response data, the ACK will be sent delayed to wait and see if there is response data to send together;
  • If the other party’s second data packet arrives while waiting to send ACK, ACK should be sent immediately. However, if the other party’s three data packets arrive one after another, whether to send ACK immediately when the third data segment arrives depends on the above two conditions.

Nagle Algorithm and Delayed ACK

Back to the practical issues of mobility

Client and YdServer perform data transmission: YdServer runs Naglethe algorithm, and Client runs Delayed ACKthe algorithm. If YdServer sends a data packet No.398 to Client , Client Delayed ACKwill not respond immediately due to . YdServer uses Naglethe algorithm, and YdServer will wait for the ACK from Client. If ACK does not come, it will not send the second data packet. Then the request will be delayed for 200ms. Then Client sends No.400 ACK, and YdServer does not receive the confirmation, so it performs No.401 timeout retransmission. Therefore, the client local packet capture determines that it is suspected retransmission, and Client triggers No.402 DUP ACK. After that, the four packets such as No.398, No.400-No.402 are repeated as a similar pattern.

After repeating No.602 and No.612, this pattern suddenly disappeared. Why? Because RTO is dynamically adjusted and increased, after No.612 ACK is returned to YdServer, YdServer has not had time to time out and retransmit. However, due to this abnormal problem, the entire interaction causes slow data transmission, making the APP unable to be used normally, and the page displays 0.0kb/s.

Summary of the problem

The YdServer in question is actually a CDN mobile node server. The problem was reported to the application side, and the CDN node was subsequently rescheduled to restore to normal. Therefore, in some application scenarios, please do not use the Nagle algorithm and delayed ACK together