Troubleshooting TCP Retransmission Issues on Linux Servers

Unicorn tutorials

Background

This article explores the troubleshooting process for TCP retransmission issues encountered on a Linux server used in a business system. An R&D colleague reported frequent TCP retransmissions across various types of communication, including business data, SSH, and SFTP traffic. By analyzing the network trace files, we pinpoint the underlying causes of the retransmissions and outline effective solutions to address these network challenges.

The first reaction is of course impossible. If this is the case, there must be problems with the business long ago. But I can’t be too sure. After all, face-slapping incidents are still possible. What if?

Problem Analysis

Since we saw the packet retransmission phenomenon, from the perspective of packet analysis, we naturally need to see it with our own eyes. We further communicated some basic information. The person in charge of the business system provided some end-to-end IP communication pairs, and learned that the business system was deployed in multiple data centers and multiple machines had the same problem.

At this point, I felt relieved. No matter what, it definitely had nothing to do with the Internet, otherwise the company’s business would have collapsed long ago.

But the phenomenon of data packet retransmission still exists. Since it exists, we still need to give a reasonable explanation for the cause of the problem. Only by looking at the packets specifically can we understand the problem clearly.

The trace files of the three servers are as follows. First of all, the file size is too large. The largest file is 3.8G before compression, and the compressed 434M is already very difficult for Wireshark to parse. Load and reload~


Packet Capture

From my personal experience, I would say that if you can avoid capture filtering, don’t do it. The reason is that unless there is a performance issue or you are 120% proficient in the protocol, don’t use capture filters. The performance issue here is that the trace file is too large, Wireshark can’t load and analyze it well, and it will be loaded and loaded every time.

By capinfos looking at the information of the largest trace file, the file size is about 4GB, the duration is 37 minutes, and 4.7M data packets are captured. No wonder ~ it is not big…


Packet Analysis

It is impossible to directly open such a large trace file for further analysis. Considering the problem phenomenon reported by the person in charge of the business system, each TCP packet has retransmission, so we can use editcap 1000 data packets for simple analysis.

$ editcap -r 20220226.cap test.pcapng 1-1000

The size of the processed test trace file is only 786 KB. In fact, you can already know what the problem is after opening the file. The trace file data packets are repeated.

troubleshoot TCP retransmission issues

The reason is actually capinfos indicated in the information, Linux cooked-mode capture v1 (25 – linux-sll), which means that the interface captured by tcpdump is any, which means there are duplicate network cards. After further inquiry with the person in charge of the business system, it was found that the command actually run was tcpdump -i any xxx, and the server network card was bonded, which means that the physical network card captured a data packet, and the logical network card bond also captured the same data packet, so naturally it would be displayed as a large number of duplicates in Wireshark.

How to remove duplicates? We still use to achieve the goal. We can see that there are 517 duplicate packets editcap in the original file test.pcapng out of 1000 packets. After removing duplicates, there are 483 packets left in test1.pcapng.

After opening the trace file again, the data packets are normal, and naturally the actual business is also normal.

Summary of the problem

By adopting the correct capture method, making good use of CLI tools, and adding reasonable judgment and analysis, the problem will naturally be solved.

Share this