Problem Description
Recently, a colleague shared a case regarding a Load Balancer Communication Issue, where the primary and backup load balancing devices failed to establish high availability (HA) when both units were activated in a test environment. Upon investigation, it was discovered that the primary and backup load balancers could not ping each other. This was particularly puzzling since both devices were directly connected to a core stacking switch within the same VLAN. In contrast, other hosts in the same VLAN or across different VLANs were able to ping both load balancers without issue. The problem appeared to be isolated between the primary and backup devices.
The network topology is as follows
equipment | IP | MAC |
---|---|---|
LB-01 | 10.0.0.1 | 4b:00:05:92:01:02 |
LB-02 | 10.0.0.2 | 4b:00:05:92:00:09 |
Problem Analysis
Considering that the network topology, network configuration and fault phenomena are extremely simple, the verification method is also very clear, and it only requires confirming where the data packet is lost.
LB-01
The data packet file is as follows, briefly analyzed:
- The ARP information of LB-02 on the LB-01 device is normal;
- LB-01 initiates an ICMP Request, but no ICMP Response is returned, so the Ping result is unsuccessful;
- LB-01 receives an ARP broadcast request from LB-02 and queries the MAC address of LB-01. LB-01 returns an ARP response normally. However, this process continues, and it is suspected that LB-02 did not receive a response.
LB-02
The data packet file is as follows, briefly analyzed:
- There is no ARP information of LB-01 on the LB-02 device;
- LB-02 continuously initiates ARP broadcast requests to query the MAC address of LB-01, but cannot receive ARP responses from LB-01;
- LB-02 device also cannot receive ICMP Request packets from LB-01.
Combining the data packet file analysis of LB-01 and LB-02, it can be basically inferred that the data packets are lost on the intermediate switch, and the data packets suspected to be from LB-01 cannot be forwarded to LB-02 normally. Further packet capture analysis is performed on the switch .
Switch
The switch is H3C S6800 model, two switches use IRF stacking, mirroring source ports Te1/0/42 and Te2/0/42, and after a Ping operation is performed on LB-02, the captured data packet file is briefly analyzed as follows:
- Data packet 1 is the ARP broadcast request initiated by LB-02 captured in the inbound direction of the port connected to LB-02, which queries the MAC address of LB-01, proving that the switch can receive it normally;
- Data packet 2 is the ARP broadcast request initiated by LB-02 captured in the outbound direction of the port connected to LB-01, which queries the MAC address of LB-01 and proves that the switch can forward normally;
- Data packet 3 is the unicast data packet of LB-01 responding to LB-02’s ARP request, captured in the inbound direction of the switch’s port connected to LB-01, proving that the switch can receive it normally;
- But then the switch did not forward packet 3 to LB-02 normally;
- This process is repeated 4 times, and the Ping operation continues to request ARP information.
After opening a case with H3C, the same phenomenon was observed through the following flow system configuration. Only four ARP response packets were matched in the inbound direction of Te1/0/42 port, but no ARP response packets were matched in the outbound direction of Te2/0/42 port.
acl mac 4000
rule 0 permit type 0806 ffff source-mac 4b00-0592-0102 ffff-ffff-ffff dest-mac 4b00-0592-0009 ffff-ffff-ffff
rule 1 permit type 0806 ffff source-mac 4b00-0592-0009 ffff-ffff-ffff dest-mac 4b00-0592-0102 ffff-ffff-ffff
#
traffic classifier arp operator and
if-match acl mac 4000
#
traffic behavior arp
accounting packet
#
qos policy arp
classifier arp behavior arp
#
interface ten-GigabitEthernet1/0/42
qos apply policy arp inbound
qos apply policy arp outbound
#
interface ten-GigabitEthernet2/0/42
qos apply policy arp inbound
qos apply policy arp outbound
#
<Switch>dis qos policy interface ten-g1/0/42
Interface: Ten-GigabitEthernet1/0/42
Direction: Inbound
Policy: arp
Classifier: arp
Operator: AND
Rule(s) :
If-match acl mac 4000
Behavior: arp
Accounting enable:
4 (Packets)
Interface: Ten-GigabitEthernet1/0/42
Direction: Outbound
Policy: arp
Classifier: arp
Operator: AND
Rule(s) :
If-match acl mac 4000
Behavior: arp
Accounting enable:
0 (Packets)
<Switch>dis qos policy interface ten-g2/0/42
Interface: Ten-GigabitEthernet2/0/42
Direction: Inbound
Policy: arp
Classifier: arp
Operator: AND
Rule(s) :
If-match acl mac 4000
Behavior: arp
Accounting enable:
0 (Packets)
Interface: Ten-GigabitEthernet2/0/42
Direction: Outbound
Policy: arp
Classifier: arp
Operator: AND
Rule(s) :
If-match acl mac 4000
Behavior: arp
Accounting enable:
0 (Packets)
Summary of the Load Balancer Communication Issue
After H3C TAC + R&D conducted traffic statistics and fault diagnosis, it was initially determined that the switch software version was a bug. After replacing a common stable version and rebooting, the primary and standby load balancing devices resumed normal communication.
This problem is rare, but it is relatively simple to troubleshoot and locate. The main thing is to reasonably determine the packet loss point. The source and destination need to prove that they can send and respond, and the intermediate equipment needs to prove that it can receive and forward.