How to Solve the Server TIME_WAIT States Problem

What Are Server TIME_WAIT States?

Specific commands are frequently employed in routine server maintenance to monitor Server TIME_WAIT States. Recognizing and addressing these common connection issues is essential for optimal server performance.

netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'    

It will display information such as the following:

TIME_WAIT 814
CLOSE_WAIT 1
FIN_WAIT1 1
ESTABLISHED 634
SYN_RECV 2
LAST_ACK 1

The three commonly used states are: ESTABLISHED indicating communication, TIME_WAIT indicating active shutdown, and CLOSE_WAIT indicating passive shutdown.

There is no need to explain what each status means. You can understand it by looking at the following diagram. Note that the server mentioned here should be the party that receives and processes the business request:

Server TIME_WAIT States

You don’t need to remember all these states, understand the meaning of the three most common states I mentioned above. Generally, you won’t check the network status unless it’s essential. If the server has an abnormality, 80 to 90 percent of the time it’s the following two situations:

1. The server maintains a lot of TIME_WAIT states

2. The server maintains a lot of CLOSE_WAIT states

Because the number oflinux file handles allocated to a user is limited, and TIME_WAIT if CLOSE_WAITthe two states are maintained, so the corresponding number of channels are always occupied, and they are “occupying the toilet without doing anything”. Once the upper limit of the number of handles is reached, new requests cannot be processed, and then there will be a large number of Too Many Open Filesexceptions and WEB server crashes. . .

Let’s discuss the treatment methods of these two situations. Many materials on the Internet confuse the treatment methods of these two situations, thinking that optimizing the system kernel parameters can solve the problem. It is inappropriate. Optimizing the system kernel parameters TIME_WAIT may be easy to solve, but CLOSE_WAIT The situation still needs to start with the program itself. Now let’s talk about the treatment methods for these two situations separately:

How to Manage Server TIME_WAIT States

The server maintains a large number of TIME_WAIT states

This situation is quite common. Some Crawler or WEB servers (if the network administrator does not optimize the kernel parameters during installation) often encounter this problem. How does this problem arise?

From the diagram above, we can see TIME_WAIT that the state is maintained by the party that actively closes the connection. For the Crawler client, it is the “client”. After completing a crawling task, it will initiate the active connection closure, thus entering TIME_WAI Tthe state, and then completely close and recycle resources after maintaining this state for 2MSL (max segment lifetime). Why do you do this? Why do you keep the resources for a while when you have actively closed the connection? This is TCP/IPstipulated by the designer, mainly for the following two considerations:

  1. Prevent packets in the previous connection from getting lost and reappearing, affecting new connections (after 2MSL, all duplicate packets in the previous connection will disappear)
  2. Close the connection reliably TCP. The last one sent by the active closing party ack(fin)may be lost. At this time, the passive party will resend it fin. If the active party is in CLOSED the state at this time, it will respond rstinstead of being in the state ack. Therefore, the active party must be in TIME_WAITthe state, not in the state CLOSED. In addition, such design server TIME_WAIT states will recycle resources regularly and will not occupy a lot of resources unless a large number of requests are received in a short period or it is attacked.

Regarding MSL, I quote the following passage:

  1. MSL is the duration of a TCP Segment (a block of TCP network packets) from the source to the destination (that is, the duration of a network packet’s survival when transmitted on the network). Since the RFC 793 TCP transmission protocol was defined in 1981, the network speed at that time was not as advanced as the current Internet. Can you imagine that you have to wait for 4 minutes for the first byte to appear after you enter a URL in the browser? In the current network environment, it is almost impossible for such a thing to happen. Therefore, we can significantly reduce the duration of the server TIME_WAIT states so that ports can be vacated for other connections more quickly.

Let me quote another passage from an online resource:

  1. It is worth mentioning that for the TCP-based HTTP protocol, it is the server that closes the TCP connection, so the server will enter the server TIME_WAIT states. As you can imagine, for a Web Server with a large number of visits, there will be a large number of server TIME_WAIT states. If the server receives 1000 requests per second, there will be a backlog of 240*1000=240,000 TIME_WAIT records, and maintaining these states will burden the server. Of course, modern operating systems will use fast search algorithms to manage these TIME_WAITs, so for a new TCP connection request, it will not take too much time to determine whether it hits a TIME_WAIT, but it is always not good to have so many states to maintain.
  1. HTTP protocol version 1.1 specifies that the default behavior is Keep-Alive, which means that the TCP connection will be reused to transmit multiple requests/responses. One of the main reasons is the discovery of this problem.

That is to say, the HTTP interaction is different from the one shown above. It is not the client that closes the connection, but the server, so a large number of TIME_WAITsituations will also occur on the web server.

Now let’s talk about how to solve this problem.

The solution is simple, which is to enable the server to quickly recycle and reuse those TIME_WAIT resources.

Here are my /etc/sysctl.conf file modifications:

1. #For a new connection, how many SYN connection requests should the kernel send before deciding to give up? It should not be greater than 255. The default value is 5, which corresponds to about 180 seconds.
2. net.ipv4.tcp_syn_retries=2
3. #net.ipv4.tcp_synack_retries=2
4. #Indicates the frequency of TCP sending keepalive messages when keepalive is enabled. The default is 2 hours, change to 300 seconds
5. net.ipv4.tcp_keepalive_time=1200
6. net.ipv4.tcp_orphan_retries=3
7. #If the socket is closed by the local end, this parameter determines the time it remains in the FIN-WAIT-2 state
8. net.ipv4.tcp_fin_timeout=30
9. #Indicates the length of the SYN queue, the default is 1024, increase the queue length to 8192, can accommodate more network connections waiting to connect.
10. net.ipv4.tcp_max_syn_backlog = 4096
11. #Indicates turning on SYN Cookies. When the SYN waiting queue overflows, enable cookies to handle it, which can prevent a small number of SYN attacks. The default is 0, which means off
12. net.ipv4.tcp_syncookies = 1

14. #Indicates turning on reuse. Allow TIME-WAIT sockets to be reused for new TCP connections. The default value is 0, which means closed.
15. net.ipv4.tcp_tw_reuse = 1
16. #Indicates the fast recycling of TIME-WAIT sockets in TCP connections. The default value is 0, which means closed.
17. net.ipv4.tcp_tw_recycle = 1

19. ##Reduce the number of probes before timeout
20. net.ipv4.tcp_keepalive_probes=5
21. ##Optimize the network device receiving queue
22. net.core.netdev_max_backlog=3000

After modification, execute /sbin/sysctl -pto make the parameters take effect.

The main thing to note here is

net.ipv4.tcp_tw_reuse
net.ipv4.tcp_tw_recycle
net.ipv4.tcp_fin_timeout
net.ipv4.tcp_keepalive_* 

These parameters.

The purpose of enabling net.ipv4.tcp_tw_reuse and net.ipv4.tcp_tw_recycle is to recycle resources in the server TIME_WAIT state.

net.ipv4.tcp_fin_timeout can reduce the time it takes for the server to switch from FIN-WAIT-2 to TIME_WAIT under abnormal circumstances.

net.ipv4.tcp_keepalive_* is a series of parameters used to set the server's configuration for detecting connection survival.

The server maintains a large number of CLOSE_WAIT states

Take a break and take a breath. At first, I just planned to talk about the difference between TIME_WAIT and CLOSE_WAIT, but I didn’t expect that the more I dug, the deeper I got. This is also the advantage of writing a blog summary. There can always be unexpected gains.

TIME_WAIT The status can be solved by optimizing server parameters because TIME_WAIT what happened is controllable by the server itself. Either the other party’s connection is abnormal, or the server fails to recycle resources quickly. In short, it is not caused by a program error.

But CLOSE_WAIT it is different. As can be seen from the figure above, if it remains in CLOSE_WAIT the state, there is only one situation, that is, the server program itself does not send further acksignals after the other party closes the connection. In other words, after the other party closes the connection, the program does not detect it, or the program simply forgets that the connection needs to be closed at this time, so the resource is always occupied by the program. I think this situation cannot be solved by the server kernel parameters. The server has no right to actively reclaim the resources occupied by the program unless the program is terminated.

If you are using it HttpClientand you encounter a lot of situations, then this CLOSE_WAIT blog may be useful to you: http://blog.csdn.net/shootyou/article/details/6615051

The author gave a scenario in the log of that article to illustrate the difference between CLOSE_WAIT and TIME_WAIT, and I will describe it again here:

Server A is a crawler server. It uses a simple HttpClient to request Apache on resource server B to obtain file resources. Under normal circumstances, if the request is successful, then after crawling the resources, server A will actively send a request to close the connection. At this time, it is actively closing the connection. We can see that the connection status of server A is TIME_WAIT. What if an exception occurs? Assuming that the requested resource does not exist on server B, then server B will send a request to close the connection at this time. Server A is passively closed. If the programmer forgets to let HttpClient release the connection after server A passively closes the connection, it will cause the CLOSE_WAIT status.

Conclusion

Effectively managing Server TIME_WAIT States is crucial for maintaining optimal server performance. By understanding the causes and implications of these states, administrators can implement strategic solutions to recycle and reuse resources efficiently. Addressing issues related to the Server TIME_WAIT States not only prevents resource exhaustion but also enhances the server’s ability to handle incoming requests. Regular monitoring and configuration adjustments are essential practices that contribute to a stable and high-performing server environment.