1. Symptoms
Today’s “patient” is quite unique, a supervisor from a power information department. What makes this case special is that this department’s supervisor has repeatedly called for network diagnosis of their wide-area connection issues, but every time they are informed that the “fault has been resolved” within 15 minutes. When asked about the resolution method, the answer is usually resetting the entire system. This user has installed an expensive “network management system” to oversee the entire network and lacks other maintenance tools. The network hospital had previously suggested conducting a comprehensive check of this information network, including various wiring systems, network equipment, working protocols, load balancing, load capacity, error frame tolerance, etc., but for various reasons, this had not been implemented. Today’s symptoms remain consistent with previous issues: poor connectivity between a power plant’s information network and the power information center’s network, unstable data transmission speed, intermittent connectivity issues. The difference this time is that resetting the system no longer works.
2. Diagnostic Process
The network consists of nine sub-networks from different power plants, with one sub-network connected via X.25 and the other eight gradually transitioning to DDN links since last year. Among them, one dedicated DDN line (Line 7) occasionally experiences connection interruptions, and to restore the system, the router must be reset to re-establish the connection. However, on this occasion, the usual remedy proved ineffective, and the system remained faulty until our arrival. We connected a network tester to the information center’s network and observed the routers connecting each power plant’s sub-network. Checking the working table for the 7th router, there were records of occasional transmission delay errors. The channel traffic for 30 seconds was only 7 frames, whereas other lines ranged from 170 frames to 2700 frames for the same period, noticeably higher than Line 7. Conducting a channel test on the 7th sub-network, the maximum speed achieved was 2 kbps, significantly lower than the 64 kbps line’s maximum rate, indicating weak data transmission capability on the DDN link. As the router had limited error identification and statistics capabilities, network management couldn’t provide more detailed statistics. Therefore, we employed an F69x traffic analyzer to tap into the WAN channel for testing, and we discovered a significant number of undefined frame types with unstable record labels. These are bitstreams that are unclassified, leading to fluctuating traffic rates, which appear to be interference signals infiltrating the network. This junk data significantly impacts the normal data exchange and transmission.
To assess the impact, we conducted ICMP Ping tests toward remote sub-networks. The loss rate was 10%, which is not extremely high. However, the ICMP Monitor tests showed 50% target unreachable, 20% redirection, and 85% congestion, indicating severe problems within the router’s channels. There was no evidence of interfering bitstreams in the central network, and we also checked the harmonic content of the UPS supplying power to the 7th router, which was normal. This essentially ruled out the possibility of external interference bitstreams entering the network. We replaced the 7th router with others and reconfigured them, but the issue persisted. Given that the amount of junk bits was low, it wasn’t causing a significant drop in network channel performance, leading us to suspect that the source was either the dedicated DDN line or a router at the remote sub-network (although this was less likely). The local information center lacked the testing tools for DDN lines, and without substantial evidence to indicate the DDN line as the problem (since it was a leased telecom line), we had to start the investigation from the remote sub-network. The remote sub-network had no network maintenance tools, and the network management system at the central network couldn’t detect detailed error statistics. We decided to travel to the location of the 7th power plant for further investigation. Four hours later, we arrived and began testing. The 7th sub-network’s internal data exchange was normal with no junk bitstreams. Upon inspecting the router’s working table, there were a few frame delay packets. However, the WAN connection data exchange issues persisted. Using an F69x traffic analyzer to test the channel traffic, we found a significant amount of “junk bits,” amounting to 55 kbps, with 35% of them originating from the remote router. This confirmed that the issue was caused by a fault in the remote router or the DDN link near the router. After replacing the router with a spare one from the information center, the issue was resolved.
3. Conclusion
WAN channel failures can result from various causes. Generally, if the channel test fails, it signifies issues within the WAN link, which may comprise different transmission media and protocols such as ATM, DDN, ISDN, Frame Relay, SDH, etc. For different types of WAN links, dedicated testing tools should be used for precise diagnosis.
However, as most users don’t possess WAN testing tools (though some integrators do provide these tools), users or system integrators must initially employ the process of elimination to determine whether the issue lies within the network devices (including routers) or within the WAN link. It’s only after ruling out internal network problems that they can request the WAN link operator to inspect the service channel. This particular failure was due to a malfunction in the remote router. The router was sending a considerable amount of junk bits in addition to normal data, occupying channel bandwidth and seriously affecting regular data transmission. In the past, despite the router’s unstable operation, the system would resolve itself within a short time, which is why they were able to claim “fault resolution within 15 minutes” (we refer to such faults as “soft faults”). This time, the fault transitioned from a soft fault to an unresolved “hard fault,” which actually facilitated its diagnosis. The network filtered out most data on the DDN link, and the remote router’s error data recognition capabilities were limited. Therefore, the central network observed minimal junk bits, and inspecting the remote router didn’t yield detailed error statistics. However, ICMP Ping tests and ICMP Monitor tests produced higher error rates and detected significant junk bitstreams originating from the remote (the combination of the “F69x traffic analyzer” and “F68x network tester” provides robust detection capabilities with comprehensive error recognition and statistics functions, explaining why the possibility of DDN link failure was considered minimal). Thus, we concluded that the fault was situated in the remote router.
In fact, if the remote sub-network had appropriate testing tools, this fault could have been resolved in a very short time.
4. Diagnostic Recommendations
One must have the right tools for the job. In large networks, it’s necessary to keep some spare network equipment, and depending on the network’s size, usage level, and the technical expertise of the maintenance staff, the appropriate maintenance tools should be provided. It’s essential to establish a comprehensive testing and maintenance plan and regulations to ensure network reliability and prompt handling of various network faults.
While most network devices provide basic network management functions capable of recognizing and identifying around 30% to 40% of network errors and fault information, it sometimes creates the misconception that merely having network management functionalities is sufficient for identifying all network issues. However, advanced performance testing requires specialized tools that can not only recognize various working protocols but can also identify various types of online “junk.” Generally, in addition to having LAN testing tools, WAN testing tools should be available for performance assessment, fault handling, and regular testing.
5. Afterword
Two days later, the “patient” called to report that they discovered issues with the router’s circuit board. The router’s direct current voltage was unstable, and after further testing, it was determined that the voltage of the voltage regulator IC was unstable and had a high temperature. After replacing the IC, the router returned to normal operation.