Comprehensive TCPDump Analysis for Troubleshooting Batch Processing Errors: A Case Study

Before we begin: Let’s start with an overview of the tcpdump analysis.

Recently, an insurance client urgently approached me saying: “Hey Tao, we’ve been experiencing errors every night during batch processing. Out of a hundred thousand HTTP POST requests, around six or seven hundred requests fail. This issue can be reproduced at any time. Please take a quick look.” After I, as the seasoned expert, thoroughly examined the entire business access process, I offered a remedy, paired it with strong measures, and basically cured the problem.

Since this fault scenario is very representative, I have organized this article in the hope of providing some inspiration for you all in daily operation and maintenance troubleshooting. Without further ado, here are the insights.

Network Logic Topology and Business Flow:

The logical topology diagram of the business is shown below.

Batch processing business servers: Server1 10.160.XX.81:8000, Server2 10.160.XX.82:8000. Load balancing and business publication are provided through the front-end F5 device, with a VIP address: 10.50.XX.67:8165.

Test terminal: Sends POST requests to the F5 VIP 10.50.XX.67:8165 using Curl (batch processing flow)

Access Process:

  1. The user submits a POST request to F5 VIP 10.50.XX.67:8165 using a script through curl.
  2. F5 forwards the customer’s POST request to the real server, Server1 or Server2, according to the load balancing algorithm.
  3. Between F5 and the server, there is a domestic XX letter NGFW device (the upstream F5 device uses the firewall feth11 interface, and the downstream switch uses the firewall feth10 interface). Pay attention to the firewall ingress and egress interfaces, which will be mentioned when analyzing data packets.

Fault Phenomenon:

When the user submits a POST request to F5 VIP 10.50.XX.67:8165 using a script, it gets stuck after running for some time (i.e., after the POST request is sent out, there is no response), and a few seconds later an F5 RESET packet is received.

The Magic Tool TCPDUMP:

At first glance, it’s obvious that F5 sent a RESET packet, so it seems to be F5’s fault. However, seeing isn’t always believing; it is necessary to see through the appearance to find the essence of the problem. It’s not always the person who sends the RESET packet who is at fault.

Without further words, let’s take out the magic tool tcpdump. This kind of complex problem requires a powerful solution. While the client performs batch processing operations, we conducted packet capture at the following three positions.

1. Packet capture on the F5 device using tcpdump

2. XX letter firewall packet capture, capturing all traffic passing through the upstream interface feth11 and the downstream interface feth10.

3. Use tcpdump on Server1 10.160.XX.81 and Server2 10.160.XX.82 to capture all packets passing through the server network card.

While capturing packets:

When the test terminal runs the batch processing script, packet capture is performed at all three points simultaneously. When the test terminal reproduces the fault, stop packet capture at the three points.

Data Packet Analysis:

Sure enough, the fault was successfully reproduced. Now we enter the most important stage, data packet (blame) analysis (shifting):

1. Analysis of data packets on F5: Client-side analysis from test client to F5 VIP.

Pay attention to the fault packet time: 20:57:12. The following analysis is based on the fault data packet at this time point. Pay attention to the packet timestamp!!!

By viewing the data packets captured on F5 using wireshark, we found the fault session and observed the following information:

  1. The TCP three-way handshake between client 10.50.X.88:54373 and F5 VIP 10.50.XX.67:8165 was completed successfully.
  2. The client initiated a POST request, and F5 acknowledged receiving the request (ack).
  3. F5 did not respond with an HTTP response to the client.
  4. F5 returned an RST to the client, with the RST reason being: F5RST(peer) TCP retransmit timeout (retransmission timeout).

In fact, after F5 and the client complete the TCP three-way handshake, F5 selects a server to establish a TCP three-way handshake based on the load balancing algorithm, then forwards the POST request from the client to the real server:

Using the F5 extension plugin on wireshark, we can see the following information in this session flow.

  1. In this session, F5 selected server 10.160.XX.82:8000.
  2. F5 enabled source address translation, translating the client’s real IP address 10.50.XX.88 to 10.50.XX.247.
  3. The source port was 43166.

2. Analysis of data packets on F5: Analysis of traffic from F5 (10.50.XX.247) to 10.160.XX.82:8000 (serverside):

By viewing the data packets captured on F5 using wireshark and filtering tcp.port ==43166, we observed the following information:

  1. When F5 side (10.50.XX.247:43166) attempted to establish a connection with server2 10.160.XX.82:8000.
  2. F5 sent a SYN packet to server2, but server2 did not respond with a SYN-ACK.
  3. F5 triggered a retransmission mechanism, but after three retransmissions without a response from the server side, F5 triggered an RST mechanism to forcibly disconnect the client’s connection.

3. Server-side packet analysis: Server2 10.160.XX.82 packet analysis:

From the above analysis of the data packets on F5, we examined the client-side and server-side data interactions and drew a preliminary conclusion that F5 failed to establish a connection with server2. Potential reasons are:

  1. The firewall forwarded the SYN packet from F5 to server2, but server2 did not receive it.
  2. The firewall forwarded the SYN packet from F5 to server2, server2 received it but did not respond.
  3. The firewall did not receive the SYN packet from F5 that was sent to server2.
  4. The firewall received the SYN packet from F5 but did not forward it to server2.

Now we first solve assumption 2, which is easy. We directly check the packet captured on server2. If it receives a packet from the firewall at 20:57:12, it means the firewall forwarded the packet. Whether server2 responded is clear in the packet. From the packet, we gather the following information:

After 20:57:03, server2 0.160.XX.82 did not receive any packets from F5 10.50.XX.247:43166.

4. Clearing the Clouds: XX letter firewall packet analysis:

From examining the packet on server2, we saw that server2 did not receive the packet from F5. Let’s now focus on the XX letter firewall; by checking its packet capture, we can resolve the remaining assumptions and pinpoint the problem.

Now witness the miracle moment: open the XX letter firewall packet capture and take a look:

Comparing and locating the fault packet at 20:57:13 14 15 seconds, the XX letter firewall’s upper interface feth11 received the SYN packet sent by F5 to server2, but it did not exit from the feth10 interface


Fault Location:

Through the above analysis, we can roughly reconstruct the root cause of the fault:

Initially, the client initiated a batch processing operation. Due to abnormal data forwarding by the XX letter firewall in the middle of continuous POST operations, the server side could not receive the data forwarded by F5, leading F5 to attempt retransmitting data packets. After three unsuccessful attempts to establish a connection, F5 executed an RST action to forcibly disconnect the server connection.

F5 RST Mechanism

1. Due to F5’s full proxy architecture, the client to F5 (clientside) and F5 to server (serverside) maintain two TCP/IP protocol stacks. Thanks to the architectural design, it can run HTTP/2 between the client and F5, HTTP/1.1 between F5 and the server, HTTPS on the front end, and HTTP on the back end. Therefore, F5 performs RFC compliance checks on TCP/HTTP protocols, and when specific packet structures or protocol security issues are detected, RST actions may be triggered to ensure business security and device stability. Below are some explanations for RST actions; for more details, see the link:

Link:

https://support.f5.com/csp/article/K13223

Conclusion

 

By gradually capturing packets and step-by-step approaching the truth of the problem, there is a sense of satisfaction. Although it was finally not an issue with our F5 device, being able to help the client rapidly locate the problem is essentially our value as a service provider. If there’s a pit, we’ll fill it, and it will be filled completely! I’ve been busy dealing with F5 CVE matters recently with no time to output. No worries, folks; I’ve already applied for testing the Rancher commercial version and am now setting up the demo environment. An article related to Rancher connected with F5 and k8s clusters is on the way!