Mysql Performance Optimization

,

MySQL Performance Optimization Case – High TCP Cache Configuration Causing a Large Number of MySQL Connection Interruptions.

1. Background

During the execution of batch tasks, the application encountered an issue: some tasks suddenly lost their database connections, preventing them from completing. From the database error logs, we found messages indicating “Aborted connection,” which means the communication between the client and server was unexpectedly interrupted.

2. Analysis

To identify the cause of the problem, we initially analyzed several common situations that could lead to a connection being aborted based on our experience:

  1. The client did not properly close the connection and did not call the mysql_close() function.
  2. The client was idle for a time exceeding the seconds specified by the wait_timeout or interactive_timeout parameters, causing the server to automatically disconnect.
  3. The size of the data packets sent or received by the client exceeded the value of the max_allowed_packet parameter, resulting in a connection interruption.
  4. The client attempted to access the database but lacked the necessary permissions, used the wrong password, or the connection packet did not contain the correct information.

However, upon investigation, we found that none of the above situations applied to the current issue. The tasks had been running normally before, and the program had not changed, allowing us to rule out the first situation. We checked MySQL’s timeout parameters wait_timeout and interactive_timeout and found them both set to 28800 seconds (8 hours), which far exceeded the execution time of the tasks, allowing us to rule out the second situation as well. We also examined the max_allowed_packet parameters for both the client and server, finding them both set to 64M, making it unlikely that this limit was exceeded, so we ruled out the third situation. Furthermore, we confirmed that the client’s database access permissions, password, and connection packet information were all correct, ruling out the fourth situation.

At this point, we preliminarily felt that there should be no issues at the MySQL level, and the problem might lie elsewhere.

To further pinpoint the issue, we attempted to modify some relevant kernel parameters on the server as follows:

These parameters were primarily aimed at optimizing network connection performance and stability to prevent connections from being unexpectedly closed or timing out. However, the modified results did not improve the situation; connections continued to be unexpectedly interrupted.

Finally, we conducted packet capture analysis using the Wireshark tool, where we observed an unusual phenomenon: the server was sending a large number of ACK packets to the client.

mysql performance optimization1

These ACK packets are acknowledgment packets in the TCP protocol, indicating that the server has received the client’s data packets and is requesting the client to continue sending data. But why would the server send so many ACK packets? We speculated that there might be network anomalies preventing the client from receiving the server’s ACK packets, causing the server to repeatedly send ACK packets until it timed out or received a response from the client. However, after the network team conducted checks, no significant issues were found.

Continuing with the packet analysis, we discovered another unusual phenomenon: the client was sending some window warnings to the server. These window warnings are part of TCP’s flow control mechanism, indicating that the receiving window of either the server or the client is full and cannot accept more data.

mysql performance optimization2

[TCP Window Full] is a window warning sent from the sender to the receiver, indicating that the receiving end has reached its data limit.

[TCP ZeroWindow] is a window warning sent from the receiver to the sender, informing the sender that the receiving window is full and temporarily halting data transmission.

Based on the above information, we deduced the problem’s cause: since MySQL needed to send a large amount of data, the client’s TCP buffer became full, requiring it to wait until it could process the data in the TCP buffer before continuing to receive more data. However, during this time, MySQL would continuously request the client to send more data. If the client did not respond within a certain period (default is 60 seconds), MySQL would assume that the data transmission had timed out, resulting in a connection interruption.

To validate our hypothesis, we examined MySQL’s slow log and discovered numerous entries with Last_errno: 1161.

These entries indicated that MySQL encountered a timeout error while sending data, and the frequency of these errors closely matched the number of failed tasks in the application. According to the MySQL official documentation, this error signifies:

This indicates a network write interruption, and there is a MySQL parameter that controls this behavior. Therefore, we attempted to change the net_write_timeout parameter to 600 seconds, and the batch tasks ran normally.

Thus, the reason for the MySQL connection being unexpectedly interrupted was that the client was retrieving too much data, exceeding its TCP buffer. The client needed to process the data in the buffer first. During this time, MySQL would continuously request the client to send more data, but the client was unable to respond within 60 seconds, leading to a timeout in data transmission and the interruption of the connection.

2. Conclusion

Through the analysis and attempts described above, we arrived at the following conclusions:

  1. The numerous ACK messages in the packet capture were due to the client’s buffer being full and unable to promptly provide feedback to the server, causing the server to repeatedly send ACK messages until exceeding 60 seconds (the default value for net_write_timeout), resulting in MySQL terminating the connection.
  2. The slow log contained many entries of Last_errno: 1161, indicating that the SQL had already completed execution in MySQL, but while sending data to the client, the data volume exceeded the client’s TCP buffer. The application on the client side was unable to process the cached data within 60 seconds, leading to a timeout in data transmission from MySQL to the client.
  3. Adjusting the net_write_timeout parameter at the MySQL level could only alleviate this phenomenon. The root cause was that the amount of data retrieved by individual SQL queries was too large, exceeding the client’s buffer size. The application was unable to process the cached data promptly, resulting in subsequent data transmission timeouts.

3. Mysql performance optimization suggestions

  1. Implement batch processing of data at the business layer to avoid retrieving large amounts of data from the server in a single SQL query, which could lead to insufficient TCP buffer on the client side.
  2. Increasing the net_write_timeout parameter in MySQL or enhancing the TCP buffer on the client side may alleviate this situation but will not resolve the underlying problem, as excessive data volume will still impact performance and stability.
  3. Optimize SQL statements to reduce unnecessary data returns, such as using LIMIT, WHERE, and other conditions, or employing aggregation and grouping functions to decrease data volume and improve query efficiency.