MySQL Performance Optimization Case – High TCP Cache Configuration Causing a Large Number of MySQL Connection Interruptions.
1. Background
During the execution of batch tasks, the application encountered an issue: some tasks suddenly lost their database connections, preventing them from completing. From the database error logs, we found messages indicating “Aborted connection,” which means the communication between the client and server was unexpectedly interrupted.
2. Analysis
To identify the cause of the problem, we initially analyzed several common situations that could lead to a connection being aborted based on our experience:
- The client did not properly close the connection and did not call the
mysql_close()
function. - The client was idle for a time exceeding the seconds specified by the
wait_timeout
orinteractive_timeout
parameters, causing the server to automatically disconnect. - The size of the data packets sent or received by the client exceeded the value of the
max_allowed_packet
parameter, resulting in a connection interruption. - The client attempted to access the database but lacked the necessary permissions, used the wrong password, or the connection packet did not contain the correct information.
However, upon investigation, we found that none of the above situations applied to the current issue. The tasks had been running normally before, and the program had not changed, allowing us to rule out the first situation. We checked MySQL’s timeout parameters wait_timeout
and interactive_timeout
and found them both set to 28800 seconds (8 hours), which far exceeded the execution time of the tasks, allowing us to rule out the second situation as well. We also examined the max_allowed_packet
parameters for both the client and server, finding them both set to 64M, making it unlikely that this limit was exceeded, so we ruled out the third situation. Furthermore, we confirmed that the client’s database access permissions, password, and connection packet information were all correct, ruling out the fourth situation.
At this point, we preliminarily felt that there should be no issues at the MySQL level, and the problem might lie elsewhere.
To further pinpoint the issue, we attempted to modify some relevant kernel parameters on the server as follows:
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_time = 120
net.core.rmem_default = 2097152
net.core.wmem_default = 2097152
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_max_syn_backlog = 16384
These parameters were primarily aimed at optimizing network connection performance and stability to prevent connections from being unexpectedly closed or timing out. However, the modified results did not improve the situation; connections continued to be unexpectedly interrupted.
Finally, we conducted packet capture analysis using the Wireshark tool, where we observed an unusual phenomenon: the server was sending a large number of ACK packets to the client.
These ACK packets are acknowledgment packets in the TCP protocol, indicating that the server has received the client’s data packets and is requesting the client to continue sending data. But why would the server send so many ACK packets? We speculated that there might be network anomalies preventing the client from receiving the server’s ACK packets, causing the server to repeatedly send ACK packets until it timed out or received a response from the client. However, after the network team conducted checks, no significant issues were found.
Continuing with the packet analysis, we discovered another unusual phenomenon: the client was sending some window warnings to the server. These window warnings are part of TCP’s flow control mechanism, indicating that the receiving window of either the server or the client is full and cannot accept more data.
[TCP Window Full]
is a window warning sent from the sender to the receiver, indicating that the receiving end has reached its data limit.
[TCP ZeroWindow]
is a window warning sent from the receiver to the sender, informing the sender that the receiving window is full and temporarily halting data transmission.
Based on the above information, we deduced the problem’s cause: since MySQL needed to send a large amount of data, the client’s TCP buffer became full, requiring it to wait until it could process the data in the TCP buffer before continuing to receive more data. However, during this time, MySQL would continuously request the client to send more data. If the client did not respond within a certain period (default is 60 seconds), MySQL would assume that the data transmission had timed out, resulting in a connection interruption.
To validate our hypothesis, we examined MySQL’s slow log and discovered numerous entries with Last_errno: 1161
.
These entries indicated that MySQL encountered a timeout error while sending data, and the frequency of these errors closely matched the number of failed tasks in the application. According to the MySQL official documentation, this error signifies:
Error number: 1161; Symbol: ER_NET_WRITE_INTERRUPTED; SQLSTATE: 08S01
Message: Got timeout writing communication packets
This indicates a network write interruption, and there is a MySQL parameter that controls this behavior. Therefore, we attempted to change the net_write_timeout
parameter to 600 seconds, and the batch tasks ran normally.
Thus, the reason for the MySQL connection being unexpectedly interrupted was that the client was retrieving too much data, exceeding its TCP buffer. The client needed to process the data in the buffer first. During this time, MySQL would continuously request the client to send more data, but the client was unable to respond within 60 seconds, leading to a timeout in data transmission and the interruption of the connection.
2. Conclusion
Through the analysis and attempts described above, we arrived at the following conclusions:
- The numerous ACK messages in the packet capture were due to the client’s buffer being full and unable to promptly provide feedback to the server, causing the server to repeatedly send ACK messages until exceeding 60 seconds (the default value for
net_write_timeout
), resulting in MySQL terminating the connection. - The slow log contained many entries of
Last_errno: 1161
, indicating that the SQL had already completed execution in MySQL, but while sending data to the client, the data volume exceeded the client’s TCP buffer. The application on the client side was unable to process the cached data within 60 seconds, leading to a timeout in data transmission from MySQL to the client. - Adjusting the
net_write_timeout
parameter at the MySQL level could only alleviate this phenomenon. The root cause was that the amount of data retrieved by individual SQL queries was too large, exceeding the client’s buffer size. The application was unable to process the cached data promptly, resulting in subsequent data transmission timeouts.
3. Mysql performance optimization suggestions
- Implement batch processing of data at the business layer to avoid retrieving large amounts of data from the server in a single SQL query, which could lead to insufficient TCP buffer on the client side.
- Increasing the
net_write_timeout
parameter in MySQL or enhancing the TCP buffer on the client side may alleviate this situation but will not resolve the underlying problem, as excessive data volume will still impact performance and stability. - Optimize SQL statements to reduce unnecessary data returns, such as using
LIMIT
,WHERE
, and other conditions, or employing aggregation and grouping functions to decrease data volume and improve query efficiency.