1. Symptoms
Today’s “patient” is a well-known securities company. At 9:45 in the morning, users called for urgent assistance, reporting that an angry group of investors in the trading room, claiming to have suffered significant losses, had gathered at the entrance of the brokerage’s data center. They were questioning why the dynamic information display screens for real-time trading were showing large blank areas, the data refresh and trading speed were extremely slow, and there were frequent interruptions, making trading impossible. They threatened to smash the brokerage’s computers if trading was not immediately restored. Small investors from the trading hall also started to gather outside the data center, and if not handled promptly, these agitated investors might actually vandalize the network equipment in the brokerage’s data center. I hung up the phone and immediately rushed to the brokerage branch, and on the way, I continued to gather information via mobile phone. The network in question is a 10 Mbps Ethernet with 230 users. It receives market data from satellite broadcasts and sends trading information back. The monitoring port for receiving data from the satellite receiver was observed to be completely normal, so the network administrators initially suspected a problem with the network system. There had been incidents of data transmission errors starting two months ago, with occasional blank data updates, occasional slowdowns in data updates, and intermittent issues. Although checks were performed using network management and protocol analyzers, the “symptoms” did not appear consistently, and they had virtually no impact on network speed or traders’ activities. Consequently, the issue was overlooked, and the root cause of the problem was not thoroughly investigated. Two days ago, the system successfully passed the “second round of Y2K unified certification testing for securities systems.” During the remaining time, hardware equipment was inspected and maintained, followed by a network check, which performed normally. Unexpectedly, serious problems occurred when the market opened today.
2.Diagnosis Process
Using the F683 network tester, I monitored the network for 30 seconds, observing that the network traffic was at 81% (although the network management reported it as 0.2%), with 97.6% error frames. Error types included Ghosts (93%), FCS errors (also known as CRC errors), and Jabber, indicating the presence of numerous illegal data packets in the network. Such symptoms are often related to electromagnetic interference and grounding loop issues. To pinpoint the exact location of the interference source, I turned off the power to most of the hubs connected to workstations while the servers continued to work. This action reduced the error rate to 87%, which was still high. Upon restoring the power to the hubs, I observed that the harmonics were severely exceeding the standards, with the highest being 970 mV, as indicated by the F43 power harmonics tester. The network was powered by a large UPS, and testing the UPS input power harmonics showed it was approximately 30% of the output power harmonics, significantly lower than the output’s specifications, indicating an excessive internal harmonics content. After starting a small backup UPS, the network returned to normal operation (network devices were batch rotated to reduce the load), but the network tester still showed errors, with the error rate (Ghost interference) dropping to 1.3%. Turning off the power to the hub group again, the error rate for Ghost interference decreased to 0.8%, confirming the presence of Ghost interference introduced by the grounding loop, likely coming from the main channel. Shaking the data output cable of the satellite receiver intermittently caused the Ghost interference to appear or disappear. Removing the cable eliminated the interference. The network management personnel recalled that they had touched the cable during equipment maintenance two days ago. This led to a poor connection. To enable traders to continue trading and stabilize their emotions, I replaced the cable and restored the original UPS to continue operations. After the market closed, the large UPS was replaced, and the problem was completely resolved.
3. Conclusion
The malfunction has two root causes. First, the decline in the power purification capabilities of the UPS allows external harmonics to easily enter the power system, laying the foundation for significant malfunctions. However, the accumulation of internal harmonics alone is insufficient to trigger fatal issues. Second, there’s an issue with the grounding circuit, providing a pathway for a significant amount of internal harmonics to enter the network. Internal harmonics refer to the harmonic power measured at the output of power purification devices, such as the UPS, generated by various electrical equipment (most network devices predominantly use switch-mode power supplies, which inherently produce significant harmonics). In this case, a substantial accumulation of internal harmonic power enters the trading network via the data output cables of the satellite receiver. This interference manifests as ghost disruptions eroding network bandwidth when the Ethernet network’s total traffic exceeds 80%. This leads to the paralysis of the majority of the network. On the other hand, the influx of internal harmonics disrupts normal data transmission, resulting in errors such as FCS frame errors and occasional long frames, causing data received by the satellite receiver to be erroneous, leading to blank screens or delayed data updates. This malfunction is cumulative in nature, with minor disturbances exceeding acceptable levels two months ago due to UPS performance degradation. However, this did not raise sufficient concern among network administrators. After equipment maintenance the day before yesterday, interference from the grounding circuit of the cables increased. Still, as at that time the satellite receiver was not connected to the network, network administrators only checked the working condition of the network, so the network’s performance was seemingly normal. It was only today, shortly before the stock market’s opening, when the input channel for satellite broadcast data was connected, that the problem erupted. At this point, internal harmonic interference flooded the network, nearly causing a network breakdown. Shutting down the power to the hub group resulted in a decrease in the total harmonic power and a natural reduction in interference signal strength, subsequently leading to a decline in the error rate. After replacing the UPS power supply, the error rate significantly dropped (theoretically to zero). However, the grounding circuit issue allowed 50Hz power and its high-order harmonics to still infiltrate the network, resulting in a small number of error frames. It’s worth noting that people generally assume that a network is functioning normally after replacing the UPS, potentially overlooking the continued presence of a small percentage of errors (1.3%). This could allow the significant issue of the grounding circuit to persist over the long term. The diagnostic network management system is essentially powerless to address this malfunction. [Recommendations] Regularly test the power harmonic content and network error rate. When error frames are detected, do not underestimate the situation. Additionally, it is advisable not to exceed 30 workstations that can be powered by a single power source along one line. Otherwise, consider redesigning power supply zones as you would for segmenting networks. This prevents the accumulation of excessive internal harmonic power beyond the equipment’s tolerance. If your network demands high reliability or is critical to you, it is advisable to design your network plan in such a way that essential network equipment like servers and routers is powered by individual UPS units.