1. Symptoms
On the last day of the May Day holiday, a major railway hub station reported significant issues with its ticketing system. Initially, the local ticketing system at the hub station reported a significant decrease in ticketing speed, resulting in long queues at the station’s ticket counters and significant passenger complaints. Ticketing at other city presale locations was also affected, with slow ticket issuance. Subsequently, all connected stations reported slow ticketing and neighboring stations reported it more frequently. Maintenance personnel suspected issues with the central ticketing server and decided to temporarily suspend operations, but even after quickly switching to the backup server, the speed remained slow. The system integration contractor attempted to resolve the issue, but the CPU resource utilization of the central ticketing server reached 97%, essentially running at full capacity, while other servers and network devices showed no problems. Temporarily disconnecting the connections to presale points and other stations did not resolve the issue. After nearly 7 hours of troubleshooting, the stations were under increasing pressure, preparing to initiate manual ticketing procedures.
2. Diagnostic Process
Network Hospital personnel immediately went to the ticketing center’s computer network room upon receiving the report. Network administrators mentioned that similar issues had occurred during the holiday period, but they were of shorter duration (sometimes around 2 hours) and didn’t significantly affect ticketing speed. After a brief exchange with network administrators and engineering technicians from the system integration contractor, five possible causes for the malfunction were identified: 1) Ticketing settlement software issue; 2) Virus or internal personnel, especially network administrators, making erroneous changes to the settings, such as deleting files that shouldn’t be deleted or running conflicting or destructive software; 3) Platform failure, such as hardware damage to the NT platform (irreparable changes requiring system reinstallation); 4) Network equipment issue; 5) Other network problems. Since ticketing servers and switches had already been replaced, the first and fourth possibilities were temporarily excluded. To save diagnostic time, the second and third possibilities (detailed system inspection and protocol testing, or NT platform reinstallation with proper settings and data recovery, which would take a long time) were not considered initially. The fifth possibility of other network issues was the first to be tested. The CPU resource utilization of other servers was all below 25%.
Reviewing the network topology diagram, the network tester F683 was connected to a workstation switch to observe the overall network behavior. Network device status was checked first, and it showed that switches, routers, etc., were functioning correctly. The connection port between the core switch and the ticketing server was on slot 2, port 7, set at 100Mbps, and traffic was measured at 84%, which was relatively high. Viewing the MAC conversation matrix for the entire network segment, it revealed high access traffic to the ticketing server. Further inspection of the IP conversation matrix, which was consistent with the MAC matrix, showed access traffic to the ticketing server that was over 500 times higher than other conversation members. Investigating the source of this data, it was discovered that one internal accounting PC was generating high conversation traffic with the ticketing server. Observing the MAC matrix, it showed high traffic, while the IP matrix indicated slightly lower traffic than the MAC traffic. To enhance processing speed, the ticketing server was designed to connect directly to the core switch, while the accounting PC used a desktop switch, a workstation switch, and then the core switch before connecting to the ticketing server. When questioned, the operator of the accounting PC mentioned that the machine had been malfunctioning before the holiday, with slow speed. They had reported the issue to network maintenance, but due to the upcoming holiday and high workload, maintenance personnel had planned to address the accounting PC’s problem after the holiday. When the accounting PC was powered off, the system malfunction disappeared, and the entire system returned to normal, resulting in relief. To confirm the specific fault location of the PC, it was moved to the local office network, reconfigured, and operated normally. To be cautious, network administrators decided to replace the accounting PC with a new machine and observed the network’s performance. The network completely returned to normal, resolving the issue.
3. Conclusion
In hindsight, let’s recap the entire process of the network malfunction for readers to understand the progression and the reasons behind it. The fault originated from a seemingly insignificant network card malfunction in a workstation. Initially, the malfunction occurred just before the holiday, with error frames being transmitted. However, because the workstation was connected to a store-and-forward type switch, these error frames were filtered out by the switch. Consequently, these error frames only affected this workstation and didn’t pose a threat to the network. As the network card suffered further physical damage, it became unable to clear the transmitted IP addresses, effectively locking onto the ticketing server. It began transmitting unrestricted data packets, constantly requesting the ticketing server to handle duplicate queries for the same ticket issuance. Since the network card was not subject to any speed limitations (it would continue to transmit data regardless of network traffic levels), the network switches sent a considerable amount of junk packets to the ticketing server, consuming significant network bandwidth. This forced the ticketing server to expend substantial resources processing these junk packets, thereby obstructing normal network access. Additionally, these data packets had poor operability, requiring the server to consume extra resources for processing.
As mentioned in a previous story, there are two basic manifestations of network card malfunctions. One is the “quiet” type, where the network card ceases all normal network communication and stops transmitting data, causing no significant damage to the network. The other is the “frantic” type, where, after a malfunction, the network card sends unrestricted data packets to the network. These packets can be in normal or abnormal formats (error packets). Both types can severely impact or even disrupt network performance. Error format data packets typically cannot pass through store-and-forward switches. In this case, the network monitoring tool could only see malfunction data packets in the normal format. It was only after connecting to a hub that error data packets became observable.
4. Diagnostic Recommendations
This network had a small number of system members and was not equipped with network management systems and testing tools during its construction. Consequently, there were no early excessive traffic alarm signals, which were detrimental to the rapid identification and resolution of network faults. Many existing networks rely on a post-failure maintenance approach, addressing issues as they occur. However, for networks with high reliability requirements, this approach is risky. This is because we cannot rely on the hope that network devices, both network devices and network users, will behave as the “quiet” type when experiencing a malfunction. Regular network monitoring is a powerful measure to avoid major network accidents. In this case, if a few minutes were spent daily monitoring the network, the fault could have been detected and rectified in its early stages, preventing the occurrence of significant later accidents.
5. Afterword
The “virus” we were worried about has not appeared until now.