1. Symptoms
Todayâs âpatientâ is quite unique, a supervisor from a power information department. What makes this case special is that this departmentâs supervisor has repeatedly called for network diagnosis of their wide-area connection issues, but every time they are informed that the âfault has been resolvedâ within 15 minutes. When asked about the resolution method, the answer is usually resetting the entire system. This user has installed an expensive ânetwork management systemâ to oversee the entire network and lacks other maintenance tools. The network hospital had previously suggested conducting a comprehensive check of this information network, including various wiring systems, network equipment, working protocols, load balancing, load capacity, error frame tolerance, etc., but for various reasons, this had not been implemented. Todayâs symptoms remain consistent with previous issues: poor connectivity between a power plantâs information network and the power information centerâs network, unstable data transmission speed, intermittent connectivity issues. The difference this time is that resetting the system no longer works.
2. Diagnostic Process
The network consists of nine sub-networks from different power plants, with one sub-network connected via X.25 and the other eight gradually transitioning to DDN links since last year. Among them, one dedicated DDN line (Line 7) occasionally experiences connection interruptions, and to restore the system, the router must be reset to re-establish the connection. However, on this occasion, the usual remedy proved ineffective, and the system remained faulty until our arrival. We connected a network tester to the information centerâs network and observed the routers connecting each power plantâs sub-network. Checking the working table for the 7th router, there were records of occasional transmission delay errors. The channel traffic for 30 seconds was only 7 frames, whereas other lines ranged from 170 frames to 2700 frames for the same period, noticeably higher than Line 7. Conducting a channel test on the 7th sub-network, the maximum speed achieved was 2 kbps, significantly lower than the 64 kbps lineâs maximum rate, indicating weak data transmission capability on the DDN link. As the router had limited error identification and statistics capabilities, network management couldnât provide more detailed statistics. Therefore, we employed an F69x traffic analyzer to tap into the WAN channel for testing, and we discovered a significant number of undefined frame types with unstable record labels. These are bitstreams that are unclassified, leading to fluctuating traffic rates, which appear to be interference signals infiltrating the network. This junk data significantly impacts the normal data exchange and transmission.
To assess the impact, we conducted ICMP Ping tests toward remote sub-networks. The loss rate was 10%, which is not extremely high. However, the ICMP Monitor tests showed 50% target unreachable, 20% redirection, and 85% congestion, indicating severe problems within the routerâs channels. There was no evidence of interfering bitstreams in the central network, and we also checked the harmonic content of the UPS supplying power to the 7th router, which was normal. This essentially ruled out the possibility of external interference bitstreams entering the network. We replaced the 7th router with others and reconfigured them, but the issue persisted. Given that the amount of junk bits was low, it wasnât causing a significant drop in network channel performance, leading us to suspect that the source was either the dedicated DDN line or a router at the remote sub-network (although this was less likely). The local information center lacked the testing tools for DDN lines, and without substantial evidence to indicate the DDN line as the problem (since it was a leased telecom line), we had to start the investigation from the remote sub-network. The remote sub-network had no network maintenance tools, and the network management system at the central network couldnât detect detailed error statistics. We decided to travel to the location of the 7th power plant for further investigation. Four hours later, we arrived and began testing. The 7th sub-networkâs internal data exchange was normal with no junk bitstreams. Upon inspecting the routerâs working table, there were a few frame delay packets. However, the WAN connection data exchange issues persisted. Using an F69x traffic analyzer to test the channel traffic, we found a significant amount of âjunk bits,â amounting to 55 kbps, with 35% of them originating from the remote router. This confirmed that the issue was caused by a fault in the remote router or the DDN link near the router. After replacing the router with a spare one from the information center, the issue was resolved.
3. Conclusion
WAN channel failures can result from various causes. Generally, if the channel test fails, it signifies issues within the WAN link, which may comprise different transmission media and protocols such as ATM, DDN, ISDN, Frame Relay, SDH, etc. For different types of WAN links, dedicated testing tools should be used for precise diagnosis.
However, as most users donât possess WAN testing tools (though some integrators do provide these tools), users or system integrators must initially employ the process of elimination to determine whether the issue lies within the network devices (including routers) or within the WAN link. Itâs only after ruling out internal network problems that they can request the WAN link operator to inspect the service channel. This particular failure was due to a malfunction in the remote router. The router was sending a considerable amount of junk bits in addition to normal data, occupying channel bandwidth and seriously affecting regular data transmission. In the past, despite the routerâs unstable operation, the system would resolve itself within a short time, which is why they were able to claim âfault resolution within 15 minutesâ (we refer to such faults as âsoft faultsâ). This time, the fault transitioned from a soft fault to an unresolved âhard fault,â which actually facilitated its diagnosis. The network filtered out most data on the DDN link, and the remote routerâs error data recognition capabilities were limited. Therefore, the central network observed minimal junk bits, and inspecting the remote router didnât yield detailed error statistics. However, ICMP Ping tests and ICMP Monitor tests produced higher error rates and detected significant junk bitstreams originating from the remote (the combination of the âF69x traffic analyzerâ and âF68x network testerâ provides robust detection capabilities with comprehensive error recognition and statistics functions, explaining why the possibility of DDN link failure was considered minimal). Thus, we concluded that the fault was situated in the remote router.
In fact, if the remote sub-network had appropriate testing tools, this fault could have been resolved in a very short time.
4. Diagnostic Recommendations
One must have the right tools for the job. In large networks, itâs necessary to keep some spare network equipment, and depending on the networkâs size, usage level, and the technical expertise of the maintenance staff, the appropriate maintenance tools should be provided. Itâs essential to establish a comprehensive testing and maintenance plan and regulations to ensure network reliability and prompt handling of various network faults.
While most network devices provide basic network management functions capable of recognizing and identifying around 30% to 40% of network errors and fault information, it sometimes creates the misconception that merely having network management functionalities is sufficient for identifying all network issues. However, advanced performance testing requires specialized tools that can not only recognize various working protocols but can also identify various types of online âjunk.â Generally, in addition to having LAN testing tools, WAN testing tools should be available for performance assessment, fault handling, and regular testing.
5. Afterword
Two days later, the âpatientâ called to report that they discovered issues with the routerâs circuit board. The routerâs direct current voltage was unstable, and after further testing, it was determined that the voltage of the voltage regulator IC was unstable and had a high temperature. After replacing the IC, the router returned to normal operation.