1. Symptoms
A “patient” reported that a subnet within the network suddenly slowed down while the central main network remained relatively normal. The “patient” is a telecommunications multimedia network service company in a prefecture-level city, providing local hotline website services and Internet access services for ordinary users in the city and its counties. Users of the service had reported slow network speeds and frequent delays of over 60 seconds to access their emails. Subsequently, their city’s business office (the location of the subnet) reported a sudden slowdown that affected their operations. The “patient” had a network management system installed in the data center, and observations from the network management system showed that the subnet’s router’s traffic was very high (measured at 97%), while the traffic between the central network’s routers and other subnets was below 40%. Apart from this issue, there were no other particular problems, suggesting that the network speed should not be affected. Due to a lack of other network testing tools and the inability to disrupt user services during the day, the network hospital was contacted for assistance.
2. Diagnostic Process
The fault symptoms in this case are relatively simple. During the examination, it was easy to determine the fault direction by identifying the source of routing traffic in the subnet. Further investigation quickly revealed the source of the traffic. Since the user did not have tools to analyze network traffic, we suspected that the fault was likely in the subnet. So, we directly drove to the location of the subnet, which is the telecom office.
From the network topology, the subnet in the office is connected to the central network through an E1 link, which is normally used as a business channel for the office network. Since the office network is generally used to transmit business data, it has 45 workstations, and the network management reported that 97% of the traffic was abnormally high. One possible scenario that can significantly utilize the E1 channel is the transmission of multimedia dynamic images between the office network’s websites and the central network’s websites or servers, such as VOD. This has occurred in many places, but it requires the presence of a dynamic image source, which the central network currently cannot provide (although the possibility of unauthorized installation cannot be ruled out).
The office network is relatively small, and the central network’s network management system only supports management up to the router level. Switches and servers in use are inexpensive desktop switches, making network management impossible. We connected the Fluke F683 network tester to the switch for testing, enabling portable network management functions. We observed that the traffic on the router was approximately 97%, which was consistent with what the network management system observed. Checking the traffic on the router connected to the central network also showed about 97% traffic. This indicated that the router’s channel link performance was basically normal, but such high channel traffic would inevitably lead to congestion and packet loss, making it abnormal from a traffic perspective. Now, the question was, where was this high router traffic coming from and where were the data packets going after reaching the router? This would help us quickly pinpoint the source of such high channel traffic and congestion. We connected the Fluke F695 network traffic analyzer to the router’s channel for monitoring and analysis, and the results showed that 95% of the traffic was flowing to the business data server, with the majority being applications related to HTTP and email (the traffic analyzer specifically analyzes application layer protocol traffic).
Among this, 88% was internet access traffic, and 7% was local traffic. Looking at the traffic source distribution on the traffic analyzer, we didn’t find any specific concentration of traffic applications, and the IP address distribution was relatively balanced, with the highest traffic only accounting for 0.5%. This data suggested that the user’s application usage was balanced, and the fault likely occurred in the application process rather than due to a concentrated user “attack” like a hacker. In other words, there was a problem in the application process and the channel. This was because such traffic should not have reached the business servers in the office network via the channel; it should have gone directly from the central network’s Internet main router to the internet.
So, how was this traffic directed towards the office servers? We knew that IP packets undergo address resolution (ARP) in the router during transmission, or domain name resolution in the local DNS. If these resolution processes were problematic, it would lead to issues with the exchange and transmission of IP packets. According to the traffic analyzer’s indications, we performed route tracking tests for 10 selected IP addresses, and the results showed that they all had to go through a DNS server. Additionally, we conducted ICMP monitoring and route tracking tests for local and non-local users in the office network, and the results showed that redirect packets accounted for 82% in ICMP monitoring, and destination unreachable packets accounted for 13%. This indicated that only about 2% of users could reach their destination directly through the correct route, while the remaining 95% of IP packets had to go through route competition or be resent to have a chance of reaching their destination. Therefore, we needed to focus on the main router’s routing table and the DNS translation table. Since most of the internet access traffic was being directed towards the business servers in the office network, we needed to focus on the DNS server. We used the F683 network tester to query the DNS server, and upon observing the query results, we found that a significant portion of the DNS translation table pointed to the business servers in the office network. This raised suspicion that the DNS server was the problem. We promptly informed the central network’s network management personnel to restart the DNS server and perform a quick configuration. Shortly after, the network management personnel reported that network services had returned to normal. When we used the F683 network tester’s Internet toolkit to query the DNS server, we saw that all the data pointing to the business servers in the office had disappeared. This indicated that the network had fully recovered to normal operation.
However, the relief was short-lived. About 3 minutes later, the fault reappeared, with 97% of channel traffic still directed towards the subnet. Since there was only one DNS server and no backup or secondary server, we had to go to the central network’s server room immediately to inspect the DNS server and its surrounding equipment. We tested the server’s network card and the cables connected to the router, and everything was normal. To avoid service disruption, we asked the network management personnel to temporarily install and configure another DNS server on a backup machine. After a brief business interruption, the new DNS server began to work. We observed that the traffic on the subnet router immediately dropped to 1.5%. After 30 minutes of stable operation, all users had returned to normal working conditions.
3. Conclusion
The DNS server is typically used to translate user domain names into IP addresses, and it should generally not encounter issues. However, due to some unknown reasons, the address translation was consistently pointing to the business servers in the office subnet. Business servers do not have routing capabilities, so they either rejected or ignored incoming IP packets, or they returned destination unreachable or redirection reports. This was a phenomenon we often observed during ICMP monitoring. The central network had few users, and it had a surplus of bandwidth on the 155M ATM link connecting it to the provincial network. Therefore, the internet speed for users was primarily affected by the subnet’s bandwidth. Many users had to go through the congested and ineffective E1 link, leading to route redirection and significant delays. A large number of IP packets were being directed to the subnet router with a bandwidth of only 2M, causing traffic to reach 97%, resulting in a significant slowdown in subnet operations and severe router congestion. To pinpoint the reason for the address redirection error, we recommended that the user take the following steps: first, reinstall the platform, application software, and network card drivers of the original faulty DNS server and use it during the time of lowest user count, checking if the translation table is normal. If the issue persists, replace the network card, motherboard, and other hardware components to gradually narrow down the scope of the fault.
4. Diagnostic Recommendations
To prevent DNS service instability from causing business interruptions or errors, many network administrators install backup DNS servers, meaning they install more than one DNS server. However, this practice also carries a potential risk: if the primary DNS server experiences issues, the backup server automatically takes over, sacrificing some network bandwidth, which results in a decline in overall system performance. The danger lies in the fact that this performance degradation often goes unnoticed. Therefore, to ensure that the network consistently operates in good working condition, network administrators need to regularly check the DNS server’s translation table. This is one of the recommended activities in “weekly maintenance” (of course, maintaining network performance isn’t just about checking routing optimization performance; there are many other tasks to be done, including performance evaluation, benchmark testing, channel testing, application monitoring, topology result management, and periodic maintenance, among others. For more information on these topics, interested readers can refer to “Introduction to Network Testing”). In this case, the DNS redirection error resulted in user IP packets being directed to the subnet servers. If they had been directed to a machine in the central network’s local segment instead of a server, the severity of the issue would have been reduced, and users wouldn’t have noticed a significant slowdown in speed. In such cases, the “patient” might not have felt any obvious “discomfort,” and the network would have operated with an underlying issue for an extended period. Just like with humans, regular check-ups are essential for timely detection of diseases and their underlying risks. Detecting routing optimization issues in a timely manner is also one of the tasks in periodic network testing, which is even more necessary for large-scale networks. It’s crucial to adhere to regular maintenance and testing.
Many network devices, such as routers, switches, and hubs, support SNMP network management functionality. However, to comprehensively monitor network channel performance, network devices also need to support extensive RMON and RMON2. Building a network with such devices provides excellent management and fault diagnosis capabilities. The practical issue, though, is that the cost of such network equipment is typically 6 to 10 times higher than that of regular network equipment, making it difficult for users to accept. Therefore, to monitor network service application traffic, proportions, sources, work records, and conduct packet analysis when necessary, it’s recommended that users install monitoring interfaces on important server channels or router channels. This allows for the easy connection of traffic analyzers and network testers when needed. This can reduce the time required for diagnosing issues in cases like this to around 20 minutes. Of course, if the budget permits, long-term connection of traffic analyzers to channels for multiple critical network devices at full data rates can reduce the time required for issue localization to less than 1 minute.
5. Afterword
On the third day, we conducted a follow-up phone call with the “patient,” and the network was functioning perfectly. The user had discovered that the fault lay with the mainboard of the original DNS server. The mainboard was unstable, and we suspected that the server experienced program errors during data exchange and calculations at the application layer or when exchanging data with the network card. After replacing the mainboard of another DNS server, its functionality returned to normal. The “patient” has set the repaired DNS server as the online backup DNS server to enhance network reliability.