This content is produced by the AikeSheng open-source community. The original material about the OceanBase database should not be used without permission. Please contact the editor and cite the source for reprints.
Approximately 1000 words, estimated reading time is 3 minutes.
Compared to standalone databases like MySQL, the access path of the OceanBase database is a bit longer. When connection anomalies occur, troubleshooting requires more thought steps. Below is a case study on how to perform packet capture analysis on connection timeout issues within OceanBase.
1 Problem Description
This is a production environment, and the access path is approximately as follows:
Application -> VIP -> 3-node OBProxy -> 3-node OBServer
Here, the VIP is floated using keepalived, with the versions being OBServer 4.2.1.1 and OBProxy 4.2.2.0. Occasionally, the application encounters a database connection timeout error: pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on 'xx.xx.xx.9' (timed out)"
, occurring roughly 1-2 times a day, as shown in the following log:
2 Analysis Process
1. Analyze OBProxy Logs
Approach 1: Since the VIP is bound to this OBProxy node at xx.xx.xx.12, all application accesses to xx.xx.xx.9:3306 are actually accessed at xx.xx.xx.12, so first analyze the obproxy.log on node xx.xx.xx.12.
Approach 2: What is the purpose of analyzing the OBProxy logs?
- Check if the front-end connection from the application to OBProxy has any anomalies.
- If the application to OBProxy connection is normal, then check if the back-end connection from OBProxy to OBServer was successfully established.
Due to log recycling, only logs for the time frame of 03-11 08:14 are retained. Based on the above approach, search logs from the following dimensions:
- Filter for the keyword VC_EVENT_EOS to see if there are any front-end connections that have disconnected abnormally. The results returned empty, indicating none.
grep VC_EVENT_EOS obproxy.log.* | egrep -i 'tenant_name=[a-z].*'
- Continue to check the back-end connection state from OBProxy to OBServer. A bit of statistical method is needed here: first filter requests coming from client xx.xx.xx.12 to xx.xx.xx.9:3306 at 03-11 08:14 and output the client’s IP:PORT.
grep 'xx.xx.xx.9:3306' obproxy.log.20240311084410 | egrep '2024-03-11 08:14:[0-2].*' | egrep 'caddr=\{xx.xx.xx.12' | awk -F'caddr={' '{print $2}' | awk -F'}' '{print $1}'
- Then filter the succ to set proxy_sessid event. This event indicates that the connection was successfully established. The output content is identical to the previous step, meaning all connections are normal.
grep 'succ to set proxy_sessid' obproxy.log.20240311084410 | egrep '2024-03-11 08:14:[0-2].*' | awk -F'client_addr=' '{print $2}' | awk -F'"' '{print $2}' | grep 'xx.xx.xx.12'
- Continue to output
server_sessid
after successfully establishing a connection. All output results are non-zero, indicating all connections successfully established back-end connections.
grep 'succ to set proxy_sessid' obproxy.log.20240311084410 | egrep '2024-03-11 08:14:[0-2].*' | egrep 'client_addr="xx.xx.xx.12' | awk -F'server_sessid=' '{print $2}' | awk -F',' '{print $1}'
Summary
OBProxy logs are normal; it is highly probable that OBProxy did not receive the request from the application side. This connection timeout issue might be network-related. Next step suggestion: packet capture on the application server.
2. Packet Capture Analysis
Since this is an intermittent issue and the fault time cannot be predicted, only long-term packet capture is possible, with two considerations:
- Each packet file should not be too large, hence the need for automatic result file switching.
- The packets captured should not be excessive, to reduce possible system impact, hence filtering packets as much as possible.
The packet capture command for this case is as follows:
tcpdump -X -s 0 -C 50 -W 500 -Z root -i lo -w /tmp/cap/obproxy.cap host xx.xx.xx.9
Parameter Explanation
- The application and OBProxy are on the same server, so the loopback interface
-i lo
must be specified during packet capture. -w /tmp/cap/obproxy.cap
specifies the path of the result file.-C 50
means the result file automatically switches to a new file after filling up with 50M.-W 500
indicates a total of 500 result files, and when full, will recycle and overwrite the first one.host xx.xx.xx.9
filters network packets accessing the VIP to reduce packet capture quantity.
At 3.14 15:52:57, an error occurred in the application:
Once the packet is captured, how do we analyze these network packets to find the information we need?
Approach 3: Use Wireshark’s expert information (accessed via: Analyze -> Expert Information) to quickly find potentially abnormal packets.
The Note indicates retransmission, and we only need to find packets for port 3306. Only the connection 4232 > 3306 experienced retransmission:
Upon opening the first retransmitted packet, i.e., packet 8517, check all packets in this connection (Action: Right click -> Follow Stream -> TCP Stream):
Since the client and server are on the same server, the packets displayed show communication both ways.
From the Client’s Perspective
- 15:52:47 Initiated a TCP connection request (SYN for a three-way handshake) to
xx.xx.xx.9:3306
, i.e., packet 8359. - 15:52:48 Retransmission of SYN packet 8517 to
xx.xx.xx.9:3306
, marked by Wireshark as TCP Retransmission and TCP Port numbers reused due to using the same quadruple as the first SYN packet (net.ipv4.tcp_tw_reuse=1
). - 15:52:50 Continued retransmission of SYN packet, 9075.
- 15:52:54 Continued retransmission of SYN packet, 10140.
From the Server’s Perspective
- 15:52:47 Sent a SYN+ACK packet to
1xx.xx.xx.12:4232
, i.e., packet 8360. - 15:52:48 Retransmitted SYN, ACK packet, i.e., packet 8517.
- …
The packets also contain ICMP protocol, indicating Destination unreachable(Port unreachable). According to the previous information, this was reported when the server sent a SYN+ACK packet to the client’s 4232 port, thus the client did not receive the server’s ACK packet and kept retransmitting the SYN packet.
The application error occurred at 15:52:57, exactly 10 seconds after the client initiated the TCP connection, which matches the Connector/Python default connection_timeout of 10 seconds. The application error timeout was caused by packet loss retransmission, but why was there packet loss?
The investigation halted here, with the direct cause found, but the root cause was a step away. Fortunately, the issue turned around with the next error occurrence.
3. Second Packet Capture Analysis
After a 13-day interval, on March 22nd, we captured packets at another fault time. Using Wireshark’s triptych again, the problem packets were as follows:
This anomalous packet was almost identical to that of March 9th, most importantly, the client used the same random port 4232. There’s nothing more to say – with this evidence, we questioned the network department, and finally got confirmation:
3 Conclusion
Port 4232 was blocked for external access on the network, so when the client connected to OBProxy or OBProxy connected to OBServer using a random port 4232, server packets sent back to port 4232 resulted in packet loss and inability to establish the connection.
4 Solution
Set the kernel parameter ip_local_port_range
to restrict the range of local random ports on the server to prevent the use of blocked ports:
sysctl -w net.ipv4.ip_local_port_range="10000 60999"
Tip: During OB cluster deployment, when initializing the OBServer server, net.ipv4.ip_local_port_range = 3500 65535
is automatically set. Pay extra attention to whether it conflicts with network policies.