Why Troubleshoot Router Connectivity?
Slow network, large ping delay, or packet loss are common issues that can arise during network operation and maintenance. Troubleshoot Router Connectivity plays a critical role in addressing these problems. However, it is often challenging and impractical to pinpoint the exact cause of these faults. These issues often cannot be resolved solely through data collection and analysis by network management systems or agents.
This case involves an intermittent fault analysis process of a university portal website, focusing on data packet retransmission analysis, including steps to Troubleshoot Router Connectivity. It is now organized and shared in the hope that it will be helpful and inspiring to operation and maintenance personnel.
Router Connectivity Fault
Recently, the homepage of a certain university website became intermittently inaccessible.
According to the monitoring system alarm, the fault time occurs at irregular intervals, and the duration of each fault ranges from 4 to 11 minutes.
Fault messages were received through Zabbix and Sangfor monitoring platforms.
As shown below.
Analysis Before Troubleshoot Router Connectivity
Preliminary analysis confirmed the fault. When the fault occurred, â a large number of data packets were retransmitted in the network; ⥠the service of this server was normal; ⢠other servers were normal when the fault occurred.
Conclusion: It is suspected that an application control system, akin to a WAF, is affecting the website. To address this issue, one should Troubleshoot Router Connectivity as part of the diagnostic process.
The analysis found that the server actively initiated connection requests to the outside world, which poses security risks. It is recommended to troubleshoot router connectivity to ensure these issues are identified and addressed for further confirmation.
Detailed Analysis Process
The NetInside network traffic analysis system is deployed on the aggregation switch near the server location of the university. The probe equipment is used to collect and store all traffic data of the specified link, making it unattended. The required original packets can be downloaded analyzed and decoded in real time or afterwards to quickly locate the cause of the problem.
The detailed deployment locations are as follows:
Using the NetInside System to Analyze Faults
The system analyzes the host of the faulty website 222.111.66.110 ( virtual address, the real address has been hidden. If you donât know how to hide the IP address in the packet capture file, go to the NetInside website and check it in the technical column, network analysis. Or search for âhow to hideâ on the website). The following figure is an analysis of the access within one hour from 17:00 to 18:00 on November 3, 2019.
According to system analysis, the server experienced three failures during this period, with each failure lasting 2 minutes.
Each time a failure occurs, the server has 20-50 receive failures.
At the same time, the number of connections sent by the server is significantly reduced.
The number of failures refers to the situation in which the client sends a SYN packet to the server during the TCP three-step handshake, but the server does not respond, that is, the server does not respond to the SYN/ACK packet, failing the three-step handshake.
The number of connections refers to the successful TCP three-way handshake data.
Through system analysis, we found that there was a problem when the client and the server established a TCP connection, so it was normal that the website could not be accessed.
In-depth Fault Analysis
There are many possibilities for this phenomenon, such as a network failure that prevents client data packets from reaching the server; a server hardware failure that prevents it from receiving client data packets; a problem with equipment that manages and controls specific applications in the network; a Web application failure that prevents it from receiving and responding to client data packets, etc.
The NetInside system is similar to a network camera, recording the traffic information of all servers both in normal and faulty conditions.
Next, we retrieve the âvideoâ of the failure from the system for in-depth analysis.
After analysis, it was found that when the failure occurred, there were a large number of unresponsive SYNs and a large number of data packet retransmissions in the network, as shown in the following figure.
At the same time, during the failure period, we also saw that server 222.111.66.110 had response data, but the response data was lost and retransmitted. The red box in the figure below shows that the server service is normal.
Detailed Analysis Conclusion
Through NetInside system analysis, the following conclusions were drawn:
â When a fault occurs, the network is connected, that is, the access is reachable;
âĄWhen the fault occurs, the server runs normally (whether the performance has declined requires further confirmation);
â˘When the fault occurs, only the website access provided by 222.111.66.110 (virtual address, the real address has been hidden) is affected (if you donât know how to hide the IP address in the captured packet file, go to the NetInside website and check it in the technical column, network analysis. Or search for âhow to hideâ on the website.), and the access to other websites in the network is not affected (other servers provide external services normally).
This may be caused by other control devices in the network and requires further analysis.
Suggestion
To further accurately locate the root cause of the fault, the following suggestions are made:
â Sort out the network and application structure related to the website server, form a detailed topology map, and focus on marking the location of devices such as WAF or flow control;
⥠In addition to checking the Web server log, check the fault period records in combination with other device logs to provide more information;
⢠This fault has occurred frequently recently. After sorting out the logical and physical structures, data packets are collected at multiple points, focusing on the server, the switch close to the server (there is already a NetInside device), and other key devices that may affect Web access.
By comparing the data packet information at the time of the failure, the root cause of the failure can be accurately located.