Crystal head damage leads to a large-scale network problem

Network analysis

1. Symptoms

Mr. Huang, the IT manager of a major company, is my friend. As the New Year approaches, he is facing a shortage of good news. He called today to request assistance in identifying the “culprit.”

Here’s the situation: the company has been rapidly expanding in scale. Two weeks ago, they carried out a significant network expansion project, adding 200 workstations for new employees. The network scale increased from 2000 to 2200 stations, all within a single subnet. The company uses a 100BaseT Ethernet structure with two routers connecting the production and development sites (they recently replaced them with two 155ATM backbones). I had previously suggested that they divide the subnets into smaller segments for better management and fault isolation. However, since the network had not experienced significant issues before and due to Mr. Huang’s extensive experience and confidence, as well as the unallocated maintenance budget, they maintained this “risky structure” of a large subnet.

During this expansion, they upgraded two wide-area network backbone links to 155ATM, but the subnet structure remained fundamentally unchanged, with plans to address it in a future project. Within this week, the network has experienced multiple instances of blockages, occurring at least twice a day, with each blockage lasting between 10 to 30 minutes. We meticulously examined the 200 newly installed workstations, but no problems were found. Because the issue is not consistently present and the pressure from the boss is high, Mr. Huang is feeling quite exhausted.

2. Diagnostic Process

At 10:00 in the morning, I accessed the router’s MIB library, and the recorded parameters appeared to be mostly normal, with the average network traffic at 13%. Approximately 1.5% of the traffic consisted of collisions, indicating that most components of the network structure were functioning properly. We shared a software program among the 200 new workstations and had 40 workstations download and run the software simultaneously in groups, which demonstrated that the 200 workstations were working properly.

We connected an F683 network tester and an F693 network traffic analyzer to the network for monitoring. At 14:21 in the afternoon, a network blockage occurred, lasting for 15 minutes. During this time, the F693 traffic analyzer reported normal traffic, with the average traffic increasing from 9% to 13% and then dropping to 8% one minute later. However, the F683 network tester reported traffic at around 84%, with collision frames accounting for 82% to 87%, and a small percentage of FCS error frames (approximately 2% to 4%). We recorded the Protocol Matrix protocol conversation chart before and after this time and found that during the 15-minute blockage, a total of 137 workstations had either sent or received data. Among them, four workstations were continuously transmitting and receiving data. One of the workstations had a data packet traffic that was about 15 times that of other workstations. Fortunately, Mr. Huang had previously documented the MAC addresses of the workstations, so we immediately identified the users of these four workstations based on the instrument-displayed MAC addresses (the highest traffic user was Ms. Chen from the Finance Department).

We inquired if they had made any recent changes to hardware or network cables or if they had added, removed, or adjusted any software. Their response was “no.” We asked Ms. Chen about her recent computer usage and communication with Xiao Zhang at the production site (the Protocol Matrix protocol chart indicated that Xiao Zhang’s workstation was involved). She replied, “My computer has been connected to the network all along, but I wasn’t using it just now.” We connected the network tester to Ms. Chen’s desktop computer’s network card interface and simulated traffic, which resulted in a significant increase in collisions as traffic increased. Testing the network card and cable for that link revealed that the plug was a Category 3 plug and had excessive near-end crosstalk. After replacing it with a Category 5 plug, the network returned to normal.

After several private inquiries, Ms. Chen finally revealed the truth.

3. Recommendations

Large networks should be segmented for better management and fault isolation. The network structure employed by the company, primarily using hubs with only a few switches, is not conducive to fault isolation. Additionally, it is essential to provide employees with training on proper computer usage and discourage making arbitrary changes to software and network settings. Fortunately, Mr. Huang, with his extensive experience, had maintained detailed documentation, including MAC addresses of the workstations, which helped identify the issue swiftly. In most cases, identifying such a problem might take 1-3 days.

4. Afterword

Mr. Huang, through this experience, has realized the importance of not relying solely on experience and has learned the art of being a proficient IT manager. He now understands that network maintenance is both an art and a science, and without appropriate tools and scientific methods, achieving the highest level of “artistic mastery” is challenging. As for Ms. Chen, we are inclined to keep her “secret” and protect her, and also, be discreet about Xiao Zhang for some time.

Share this