1. Symptoms
Today’s “patient” is a major district bank located in a central city, reporting intermittent and periodic disruptions that occur roughly every 10 minutes. Multiple branch offices in their jurisdiction experience business process issues during these disruptions, characterized by temporary interruptions followed by slow connection speeds. This issue has persisted for two days, and network administrators suspected router problems. They attempted to switch between the backup local settlement router and the main router, but the issue remained unresolved.
2. Diagnostic Process
We visited the “patient’s” data center and obtained information from the network management team, which matched the reported symptoms. Considering the recurring nature of the problem and past experience, we suspected issues with the router links. During service interruptions, standard Ping tests failed, but these interruptions had occurred before and usually resolved quickly.
Based on the recorded fault reports (phone registrations), router interruptions were reported not only within the city jurisdiction but also in remote networks across the district. The frequent occurrence of these disruptions every 10 minutes made diagnosis convenient. We decided to perform continuous Ping tests on a selected router to monitor its connectivity and its relationship with the timing of the faults. For this, we connected an F683 network testing tool to the data center network. Continuous ICMP Ping tests were conducted on a suburban county router that had previously reported issues. The response time was 9ms, indicating acceptable quality. Three minutes later, users reported problems, but the network testing tool showed normal operation, suggesting that the router link we monitored might be functioning normally. To investigate further, we changed our monitoring approach and conducted ICMP Monitoring on the router used by users who reported the issues. This resulted in a large number of destination unreachable records, source restrictions, request and response frames. After 20 seconds, a significant number of redirect frames appeared, and the rate of destination unreachable frames decreased, while source restrictions, request, and response frames began to appear in abundance.
These records indicated significant changes in the router’s dynamic routing table during the fault occurrence. After the initial network routing interruption, it was replaced by redirect routes. When examining the static routing table in comparison to the dynamic routing, we activated the F683 segment routing trace feature to trace the route from the testing tool to the remote router that previously reported the fault. It was observed that the route was interrupted at the next hop of the exit within the city, which was the first router in the district link. Dynamic routing had been replaced by the backup route. Status: Congested. The original route was the main route with an E1 channel speed, while the backup route had a basic-rate DDN connection with a speed of only 64Kbps. Inspection of the main router’s MIB library indicated that the traffic was at 0.02%, with a 2% error rate, suggesting it was operating under a light load with minimal error traffic. Examination of the backup router’s MIB library showed 100% traffic, indicating it was operating under a heavy load.
As the fault occurred periodically, we decided to observe the fault’s behavior throughout an entire cycle, with the “patient’s” consent, rather than immediately seeking the reasons for the main router’s interruption and congestion. We used a second network testing tool and a network fault assistant to access the network. We monitored the workloads and errors of the main router, backup router, and main server, while also conducting continuous ICMP monitoring on the main router. After approximately 8 minutes, the main router’s traffic rapidly increased, and the backup router showed redirect indications. About 15 seconds later, the backup router reported optimizing routes, and the dynamic routing table returned to the same configuration as the static routing table. The network completely returned to normal.
Analyzing the fault relationships led to the conclusion that the primary associated device was the main router. Since the users had already installed a cold backup for the main router on the rack, we initially replaced the main router with the cold backup router. After the router change was completed in five minutes, the router was connected to the network. Three minutes later, the network returned to normal. However, this only lasted for two minutes, and the fault reappeared. It seemed that a detailed monitoring of the main router was required to identify the actual cause of the fault.
The network topology showed that the main router was connected to three remote routers in other districts and a local router. We could simultaneously monitor the working conditions of these routers. The monitoring results were as follows: when the fault occurred, the routing tables of the district’s main router and the local router changed, while the local settlement business was unaffected. The affected business directions included inter-city, intra-city, inter-city via the local area, and others. We tested the remote ATM router channels using Fluke’s ATM tester, and after looping back the remote ATM switch, we monitored the conditions of the three channel directions, which appeared to be entirely normal. Further testing of the cables related to the main router showed that all of them were in good condition. This indicated that the main router’s working environment was fundamentally normal.
At this point, we needed to understand the distribution of “garbage traffic” in the main router’s link. However, the network hospital’s traffic analysis tool had been lent to another “patient,” so we couldn’t observe the main router’s detailed traffic conditions at the moment. In fact, we only needed to check the main router’s grounding quality and power supply environment (since we had already tested the replacement router) because either of these factors, if not up to standard, could potentially cause the main router’s interruption.
We first observed the UPS power supply for the main router. During the fault occurrence, the UPS indicated overload, while the output circuit displayed a light load. Observations using the F43 power quality analyzer showed that input harmonics were six times worse than usual during the fault. Output circuit harmonics were 400 times worse. After the fault was resolved, the overload indicator disappeared, but the output circuit still showed harmonics 80 times worse than normal. This indicated that the UPS power supply was inefficient.
Connecting the main router’s power supply to another UPS power source entirely eliminated the fault. The reason for the fault was determined to be poor power quality. It was noticed that the building where the data center was located was undergoing renovation, and the network management team mentioned plans to expand network equipment once the building renovations were complete. The initial source of interference was likely related to the renovations. Because of the periodic nature of the fault, careful observation revealed that the fault’s occurrence coincided with the up and down cycles of the nearby construction crane! To accurately identify the source of harmonic interference, we connected the F43 power quality analyzer to the power supply network. The results showed that whenever the crane went up, the fault occurred (when it descended, the harmonics were one-third of the upward movement, causing some minor network slowdown).
3. Conclusion
The UPS power supply for the main router was inefficient due to failure, reducing its filtering capacity against external electrical interference harmonics. When heavy-load electrical equipment was supplied power, this harmonics could cause multiple devices to malfunction. If the UPS power supply’s filtering failed at this time, related devices could be disrupted. In this fault case, the main router suffered from heavy interference, leading to link congestion, router disconnection, and routing changes that redirected various business traffic flows to the backup router, which couldn’t handle the increased load, causing network congestion. This explains why the fault initially involved interruptions, followed by recovery and later congestion. Local settlement data wasn’t affected as most of it didn’t pass through the main router.
When the crane descended, although it introduced some interference, it wasn’t sufficient to exceed the main router’s tolerance, so the main router could manage. Similar faults had occurred before renovations, but the source of interference had quickly disappeared and didn’t persist, making it less noticeable to maintenance personnel.
4. Diagnostic Recommendations
Similar to cable and optical systems, power harmonics and UPS power supplies should be included in regular checks. It’s generally recommended to perform biannual regular checks for power harmonics and UPS power supplies and weekly checks for critical network components. Harmonic interference is a common environmental factor. If the UPS power supply doesn’t fail at that time, it generally won’t affect normal network operation. However, harmonic interference can seriously impact network performance and cause “paralyzing” or fatal faults once introduced into the network. Most users are “fairly” unfamiliar with interference-type faults, so it’s advised to pay more attention to them.
5. Afterword
After replacing the UPS, the network performed excellently. What was particularly heartening was that the concept of “regular maintenance” was embraced by the “patient.” With the assistance of the network hospital, they formulated a detailed network health maintenance plan and established detailed regulations for regular and as-needed maintenance. In fact, this is the most valuable part of the network hospital’s work: being proactive to prevent issues before they occur.