Troubleshooting and Resolving TCP Communication Issues in a CEPH Cluster

Phenomenon:

Lately, the CEPH cluster encountered an issue where after restarting all OSD processes on a node, one or two PGs would always get stuck in the peering or activating phase. However, after approximately 900 seconds, upon re-establishing the connection, the cluster could recover to a normal state. Checking the logs of OSD and MS, it was found that the messages were not blocked at the software layer. The software layer had already invoked the sendmsg system call to send messages externally, but the other side did not receive the messages.

Diagnosis Process:

Filtered the tcpinfo information of the sender and receiver using the command ss -ipnome | grep ceph-osd:

Sender:

2020-06-13 14:56:03 ESTAB,0,139680,2020::16:3109:6820,2020::16:3100:48608,users:((“ceph-osd”,pid=55696,fd=70)),rtt:1.445/0.881,lastsnd:66550,lastrcv:539,lastack:539
2020-06-13 15:01:03 ESTAB,0,142560,2020::16:3109:6820,2020::16:3100:48608,users:((“ceph-osd”,pid=55696,fd=70)),rtt:1.445/0.881,lastsnd:121377,lastrcv:1357,lastack:1357
2020-06-13 15:06:04 ESTAB,0,144000,2020::16:3109:6820,2020::16:3100:48608,users:((“ceph-osd”,pid=55696,fd=70)),rtt:1.445/0.881,lastsnd:50178,lastrcv:2170,lastack:2170

Receiver:

2020-06-13 14:56:03 ESTAB,0,0,2020::16:3100:48608,2020::16:3109:6820,users:((“ceph-osd”,pid=40468,fd=597)),rtt:1.478/2.851,lastsnd:977,lastrcv:281805,lastack:977
2020-06-13 15:01:04 ESTAB,0,0,2020::16:3100:48608,2020::16:3109:6820,users:((“ceph-osd”,pid=40468,fd=597)),rtt:0.492/0.88,lastsnd:1835,lastrcv:582713,lastack:1835
2020-06-13 15:06:05 ESTAB,0,0,2020::16:3100:48608,2020::16:3109:6820,users:((“ceph-osd”,pid=40468,fd=597)),rtt:2.232/4.311,lastsnd:2659,lastrcv:883587,lastack:2659

It can be observed that the sender’s send-Q does not decrease and may even continue to grow. After establishing a connection, send-Q represents the number of bytes in the sendbuffer in the current sliding window. The sendbuffer is only cleared after receiving the ack from the peer, so it can be determined that the sender did not receive the ack packet from the peer.

Typically, the TCP sender’s send-Q does not decrease because the receiver’s receive buffer is full and sends backpressure packets with win=0 to the sender, causing the sender to stop sending packets, resulting in send-Q staying constant. However, looking at the receiver’s tcpinfo information, the receive buffer of the receiver is empty, so this is not the problem. Another possibility is that the receiver’s receive thread is stalled, but if the receive thread were stalled, the receive buffer should not be zero. Based on the receiver’s OSD ms logs, the message processing threads of the receiver are not stalled, so this assumption is also invalid.

Packet Capture Reproduction:

After numerous reproductions, and with packet captures, due to the large scale of the problematic cluster (40 nodes with 14 drives per node), capturing messages on a single network card was not feasible. Hence, packet capture had to be performed quickly after an issue occurred, allowing only capturing of messages after the problem occurred. The following is the packet information captured by the receiver (Sender: 2020::16:3109, Receiver: 2020::16:3100):

27 2020-06-13 15:03:08.503568 2020::16:3100 2020::16:3109 TCP 119 48608 > 6820 [PSH, ACK] Seq=367183868 Ack=798178296 Win=110 Len=9 SLE=798237336 SRE=798257496 SLE=798234456 SRE=798235896 SLE=798231576 SRE=798233016 SLE=798225816 SRE=798230136
28 2020-06-13 15:03:08.503622 2020::16:3109 2020::16:3100 TCP 1514 6820 > 48608 [ACK] Seq=798257496 Ack=367183877 Win=683 Len=144029
29 2020-06-13 15:03:08.503629 2020::16:3100 2020::16:3109 TCP 110 [TCP Dup ACK 27#1] 48608 > 6820 [ACK] Seq=367183877 Ack=798178296 Win=110 Len=0 SLE=798237336 SRE=798258936 SLE=798234456 SRE=798235896 SLE=798231576 SRE=798233016 SLE=798225816 SRE=798230136


70 2020-06-13 15:05:14.524574 2020::16:3100 2020::16:3109 TCP 119 48608 > 6820 [PSH, ACK] Seq=367184057 Ack=798178296 Win=110 Len=9 SLE=798237336 SRE=798258936 SLE=798234456 SRE=798235896 SLE=798231576 SRE=798233016 SLE=798225816 SRE=798230136
71 2020-06-13 15:05:14.524651 2020::16:3109 2020::16:3100 TCP 1514 6820 > 48608 [ACK] Seq=798258936 Ack=367184066 Win=683 Len=1440
72 2020-06-13 15:05:14.524660 2020::16:3100 2020::16:3109 TCP 110 [TCP Dup ACK 70#1] 48608 > 6820 [ACK] Seq=367184066 Ack=798178296 Win=110 Len=0 SLE=798237336 SRE=798260376 SLE=798234456 SRE=798235896 SLE=798231576 SRE=798233016 SLE=798225816 SRE=798230136
73 2020-06-13 15:05:20.525568 2020::16:3100 2020::16:3109 TCP 119 48608 > 6820 [PSH, ACK] Seq=367184066 Ack=798178296 Win=110 Len=9 SLE=798237336 SRE=798260376 SLE=798234456 SRE=798235896 SLE=798231576 SRE=798233016 SLE=798225816 SRE=798230136


110 2020-06-13 15:07:08.543645 2020::16:3109 2020::16:3100 TCP 74 6820 > 48608 [ACK] Seq=798260376 Ack=367184237 Win=683 Len=0
111 2020-06-13 15:07:11.723229 2020::16:3100 2020::16:3109 TCP 110 48608 > 6820 [FIN, ACK] Seq=367184237 Ack=798178296 Win=110 Len=0 SLE=798237336 SRE=798260376 SLE=798234456 SRE=798235896 SLE=798231576 SRE=798233016 SLE=798225816 SRE=798230136
112 2020-06-13 15:07:11.723270 2020::16:3109 2020::16:3100 TCP 74 6820 > 48608 [RST] Seq=798178296 Win=0 Len=0

It can be seen that the sender is consistently sending packets to the receiver, and the sender can process the receiver’s packets normally. However, the ack = 798178296 returned by the receiver to the sender shows that the sender’s seq=798260376 is already far greater than the last sequence number received by the receiver from the sender. Hence, it can be concluded that TCP did not trigger a retransmission mechanism, leading to communication interruption. Without on-site failure evidence, it is uncertain whether the packet after 798178296 was missed or not sent, thus further packet capturing and analysis are needed.

Adjusting Send Buffer Size:

Initially, it was suspected that it was related to the server’s TCP read and write buffers, but the server’s maximum send and receive buffers were already very large, so this would not normally affect anything. Due to not knowing the difference between /proc/sys/net/ipv4/tcp_wmem and /proc/sys/net/core/wmem_max, it was mistakenly assumed that the ipv4 directory configuration was used for ipv4, and the core directory configuration was used for ipv6, where the core directory’s send buffer was only 200k. So, setting /proc/sys/net/core/wmem_max to a very small value for reproducing (and synchronizing /proc/sys/net/core/wmem_default), did not reproduce the problem. It was later found through testing that both ipv4 and ipv6 use the /proc/sys/net/ipv4 directory configuration.

Adjusted the /proc/sys/net/ipv4/tcp_wmem configuration on a small cluster (three nodes, each with 15 5.5T drives) to no more than 30k. After restarting an OSD on a node, peering was stuck, but at this point, I was still rather silly. This is actually a problem, regardless of how small the TCP size is set, it should only slow down message sending, not stop it. After reducing the send buffer size, packet analysis was conducted. Because there were so many packets, tcpdump experienced packet loss issues. After capturing packets twice post-issue, no packets were captured on either the sender or the receiver following the fault (Wireshark on the sender suggested the tool didn’t capture the packet, and on the receiver indicated the tool didn’t capture the packet either). It was still suspected to be a TCP issue but without any further direction.

Conclusion:

After exhaustively searching Google, a similar issue was found to be a TCP bug. The kernel resolved the problem, and it has not occurred since. https://access.redhat.com/security/vulnerabilities/tcpsack

Programming AlgorithmsCachingDistributed Systems