One of the biggest challenges in my time working in Linux network troubleshooting has always been bridging the gap between networking and systems engineering. System administrators often lack visibility into the Linux network and frequently blame it for outages or peculiar problems. Conversely, network administrators, who have no control over the servers, suffer from a constant state of âguilty under suspicionâ fatigue concerning the Linux network and frequently attribute issues to the network endpoints.
Next, I will cover the basics of Linux network troubleshooting through the Linux command line.
TCP/IP Model of Linux Command Line
Linux Network Troubleshooting: Physical Layer
Letâs start with the most basic question: How do you tell if a physical interface is up? Use the command: IP link show
# ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000
link/ether 00:0c:29:b6:e3:71 brd ff:ff:ff:ff:ff:ff
Note the DOWN indication for the ens192 interface in the output above. This means that the physical layer is not up. First make sure the interface is not disabled. You can then try to do Linux network troubleshooting by checking the cabling or the remote end of the connection (such as a switch).
# ip link set ens192 up
The output of the IP link show can be difficult to parse at a quick glance. The -br switch prints this output in a more readable tabular format:
# ip -br link show
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
ens192 UP 00:0c:29:b6:e3:71 <BROADCAST,MULTICAST,UP,LOWER_UP>
Use the command IP link set ens192 up to solve the problem. ens192 has resumed normal operation.
These commands are great for Linux network troubleshooting obvious physical problems, but what about more subtle problems? The interface may be negotiating at the wrong speed, or collisions and physical layer issues may be causing packets to be lost or corrupted, resulting in expensive retransmissions. How do we start Linux network troubleshooting these problems?
We can use the -s flag of the IP command to print additional statistics about the interface. The following output shows a mostly clean interface with only a small amount of received packet loss and no other signs of physical layer problems:
# ip -s link show ens192
2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
link/ether 00:0c:29:b6:e3:71 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped overrun mcast
34107919 5808 0 6 0 0
TX: bytes packets errors dropped carrier collsns
434573 4487 0 0 0 0
For more advanced physical layer troubleshooting, the ethtool utility is an excellent choice. A particularly good use case for this command is to check if an interface has negotiated the correct speed. An interface that has negotiated the wrong speed (e.g., a 10Gbps interface that only reports a 1Gbps speed) could be an indication of a hardware/cabling problem, or a misconfigured negotiation on one side of the link (e.g., a misconfigured switch port).
# ethtool ens192
Settings for ens192:
Supported ports: [ TP ]
Supported link modes: 1000baseT/Full
10000baseT/Full
Supported pause frame use: No
Supports auto-negotiation: No
Supported FEC modes: Not reported
Advertised link modes: Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: 10000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 0
Transceiver: internal
Auto-negotiation: off
MDI-X: Unknown
Supports Wake-on: uag
Wake-on: d
Link detected: yes
The above output shows a link that has been properly negotiated to 1000Mbps speed and full-duplex mode.
Linux Network Troubleshooting: Data Link Layer
The data link layer is the second layer in the OSI model and is responsible for providing reliable data transmission between directly connected nodes. It defines the format and method of transmitting data frames on physical connections, as well as logical addressing and control access between nodes. The data link layer usually consists of two sublayers:
- Logical Link Control (LLC) Sublayer: The LLC sublayer is responsible for establishing and maintaining logical links, handling functions such as flow control, error detection, and correction to ensure reliable data transmission.
- Medium Access Control (MAC) Sublayer: The MAC sublayer is responsible for implementing medium access control, managing data transmission when multiple nodes share the same physical medium, and processing the physical addresses of the nodes.
The Data Link layer is responsible for local network connectivity, primarily the communication of frames between hosts in the same Layer 2 domain (often called a LAN). The most relevant Layer 2 protocol to most system administrators is the Address Resolution Protocol (ARP), which maps Layer 3 IP addresses to Layer 2 Ethernet MAC addresses. When a host tries to contact another host on its local network (such as the default gateway), it may have the other hostâs IP address, but it does not know the other hostâs MAC address. ARP solves this problem and figures out the MAC address for us.
A common problem youâll run into is a failure to populate an ARP entry, especially for the hostâs default gateway. If your local host cannot successfully resolve the layer 2 MAC address of its gateway, it will not be able to send any traffic to the remote network. This problem could be caused by having the wrong IP address configured for the gateway, or it could be another problem, such as a misconfigured switch port.
We can use the IP neighbor command to check the entries in our ARP table:
# ip neighbor show
10.6.80.1 dev ens192 lladdr 7c:1e:06:25:d2:d9 DREACHABLELAY
The MAC address of the gateway is already filled in. If there is a problem with ARP, then we will see the resolution fail:
# ip neighbor show
10.6.80.1 dev ens192 FAILED
Another common use case for the IP neighbor command involves manipulating the ARP table. Imagine that your networking team has just replaced the upstream router (i.e., the default gateway for your servers). The MAC address may have changed as well since MAC addresses are hardware addresses assigned at the factory.
Linux caches ARP entries for some time, so you may not be able to send traffic to the default gateway until the ARP entry times out. For very important systems, this outcome is undesirable. Fortunately, you can manually delete the ARP entry, which will force a new ARP discovery process:
# ip neighbor show
10.6.80.1 dev ens192 lladdr 7c:1e:06:25:d2:d9 DREACHABLELAY
10.6.80.100 dev ens192 lladdr ac:1f:6b:d2:3e:bb REACHABLE
# ip neighbor delete 10.6.80.100 dev ens192
# ip neighbor show
10.6.80.1 dev ens192 lladdr 7c:1e:06:25:d2:d9 DREACHABLELAY
Linux Network Troubleshooting: Network Layer
Layer 3 involves the use of IP addresses, which should be familiar to any system administrator. IP addresses provide a way for a host to reach other hosts outside of the local network (although we usually use them within the local network as well). One of the first steps in diagnosing a problem is to check the local IP address of the machine, which can be done using the IP address command, again utilizing the -br flag to simplify the output:
# ip -br address show
lo UNKNOWN 127.0.0.1/8 ::1/128
ens192 UP 10.6.80.202/24 fe80::20c:29ff:feb6:e371/64
tun0 UNKNOWN 10.88.0.1 peer 10.88.0.2/32 fe80::6435:83ad:f6c6:2b59/64
The ens192 interface has an IPv4 address of 10.6.80.202. If we do not have an IP address, then we need to fix this problem for Linux network troubleshooting. The lack of an IP address could be caused by a local configuration error, such as an incorrect network interface configuration file, or a problem with DHCP could cause it.
The frontline tool most system administrators use to diagnose layer 3 problems is the ping utility. Ping sends ICMP Echo Request packets to a remote host and expects an ICMP Echo reply. If you are having connectivity issues with a remote host, ping is a common utility to start Linux network troubleshooting. A simple ping command from the command line will send ICMP echoes to the remote host indefinitely; you will need to press CTRL+C to end the ping command, or pass the -c <num pings> flag, for example:
# ping www.baidu.com
PING www.a.shifen.com (180.101.50.188) 56(84) bytes of data.
64 bytes from 180.101.50.188 (180.101.50.188): icmp_seq=1 ttl=50 time=20.5 ms
64 bytes from 180.101.50.188 (180.101.50.188): icmp_seq=2 ttl=50 time=20.1 ms
64 bytes from 180.101.50.188 (180.101.50.188): icmp_seq=3 ttl=50 time=20.5 ms
64 bytes from 180.101.50.188 (180.101.50.188): icmp_seq=4 ttl=50 time=20.9 ms
^C
--- www.a.shifen.com ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 20.198/20.570/20.957/0.268 ms
The ping command includes the time it took to receive the response. While the ping command makes it easy to tell if a host is alive and responding, it is not accurate. Many network operators block ICMP packets for security reasons, although many people disagree with this practice. Another common problem is relying on the time field as an accurate indicator of network latency. Intermediate network devices can rate-limit ICMP packets and cannot be relied upon to provide a true representation of application latency.
The next tool in your layer 3 Linux network troubleshooting toolkit is the traceroute command. Traceroute utilizes the Time to Live (TTL) field in IP packets to determine the path that traffic takes to reach its destination. Traceroute will send packets one by one, starting with a TTL of 1. As packets expire in transit, the upstream router sends an ICMP Time to Live Expired packet. Traceroute then increments the TTL to determine the next hop. The resulting output is a list of intermediate routers that the packet traveled through on its way to its destination:
# traceroute www.baidu.com
traceroute to www.baidu.com (180.101.50.242), 30 hops max, 60 byte packets
1 * * *
2 10.6.1.1 (10.6.1.1) 0.729 ms 1.493 ms 0.682 ms
3 171.84.1.145 (171.84.1.145) 5.554 ms 6.539 ms 4.377 ms
4 10.250.0.9 (10.250.0.9) 7.654 ms 9.507 ms 8.480 ms
5 * * 10.34.0.2 (10.34.0.2) 1.435 ms
6 * * *
7 * * *
8 36.110.63.237 (36.110.63.237) 3.988 ms 4.979 ms 3.848 ms
9 219.141.140.73 (219.141.140.73) 3.992 ms 2.958 ms 3.919 ms
10 * 36.112.241.89 (36.112.241.89) 2.862 ms bj141-152-73.bjtelecom.net (219.141.152.73) 3.952 ms
11 202.97.92.198 (202.97.92.198) 24.938 ms 202.97.98.2 (202.97.98.2) 20.983 ms *
12 180.110.207.14 (180.110.207.14) 20.995 ms 180.110.207.2 (180.110.207.2) 21.953 ms 180.110.207.26 (180.110.207.26) 18.918 ms
13 58.213.95.210 (58.213.95.210) 21.881 ms 23.892 ms 19.861 ms
14 180.101.50.242 (58.213.96.50) 35.925 ms 34.898 ms 30.966 ms
Traceroute may seem like a great tool, but itâs important to understand its limitations. As with ICMP, intermediate routers may filter packets that Traceroute relies on, such as ICMP Time-to-Live Expired messages. But more importantly, the paths that traffic takes to and from a destination are not necessarily symmetrical and are not always the same. Traceroute may fool you into thinking that your traffic follows a nice linear path both to and from its destination. However, this is rarely the case. Traffic may follow different return paths, and paths can change dynamically for many reasons. While Traceroute may provide an accurate representation of the path in a small business network, it is generally not accurate when trying to trace across a large network or the Internet.
Another common problem you run into is when the upstream gateway for a particular route is missing, or when the default route is missing. When an IP packet is sent to a different network, it must be sent to a gateway for further processing. The gateway should know how to route the packet to its final destination. The list of gateways for different routes is stored in the routing table, which can be inspected and manipulated using the IP route command.
# ip route show
default via 10.6.80.1 dev ens192 proto static metric 100
10.6.80.0/24 dev ens192 proto kernel scope link src 10.6.80.202 metric 100
Simple topologies usually have only one default gateway configured, indicated by the âdefaultâ entry at the top of the table. A missing or incorrect default gateway is a common problem.
If our topology is more complex and we need to set up different routes for different networks, we can check the routes for a specific prefix:
# ip route show 10.6.80.0/24
10.6.80.0/24 dev ens192 proto kernel scope link src 10.6.80.202 metric 100
A clear sign of a DNS problem is being able to connect to a remote host by its IP address, but not by its hostname. A quick nslookup query on the hostname can tell us a lot (nslookup is part of the bind-utils package on Red Hat Enterprise Linux systems):
Note: DNS is not a Layer 3 protocol, but it is worth mentioning when talking about IP addresses.
# nslookup www.baidu.com
Server: 114.114.114.114
Address: 114.114.114.114#53
Non-authoritative answer:
www.baidu.com canonical name = www.a.shifen.com.
Name: www.a.shifen.com
Address: 180.101.50.242
Name: www.a.shifen.com
Address: 180.101.50.188
Name: www.a.shifen.com
Address: 240e:e9:6002:15a:0:ff:b05c:1278
Name: www.a.shifen.com
Address: 240e:e9:6002:15c:0:ff:b015:146f
Linux Network Troubleshooting: Transport Layer
The transport layer consists of TCP and UDP protocols, where TCP is a connection-oriented protocol and UDP is connectionless. Applications listen on sockets, which consist of an IP address and a port. Traffic sent to an IP address on a specific port will be routed by the kernel to the listening application.
View what ports are listening on the local host. This result can be useful if you are unable to connect to a specific service on the machine, such as a web or SSH server. Another common problem is when a daemon or service fails to start because something else is listening on the port. The ss command is very valuable for performing these types of operations.
# ss -tunlp4
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port
udp UNCONN 0 0 *:1194 *:* users:(("openvpn",pid=1023,fd=8))
udp UNCONN 0 0 *:69 *:* users:(("xinetd",pid=3232,fd=5))
tcp LISTEN 0 128 *:80 *:* users:(("nginx",pid=16373,fd=6),("nginx",pid=16372,fd=6))
tcp LISTEN 0 128 *:22 *:* users:(("sshd",pid=1008,fd=3))
The meaning of each parameter
- -t â Display TCP ports.
- -u â Display UDP ports.
- -n â Do not attempt to resolve hostnames.
- -l â Show only listening ports.
- -p â Show processes using a specific socket.
- -4 â Show only IPv4 sockets.
Looking at the output, we can see several listening services. The sshd application is listening on port 22 on all IP addresses, indicated by the *:22 output. You can use Telnet or Netcat to test the TCP connection.
# telnet 127.0.0.1 3306
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
J
-=3t24d?
!?.S=`pmysql_native_password^CConnection closed by foreign host.
# nc -v 127.0.0.1 3306
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 127.0.0.1:3306.
J
j89=;Y3x//.mysql_native_password
To test UDP, you can use Netcat.
# nc 127.0.0.1 -u 80
Ncat: Connection refused.
The same netstat command can also be achieved
# netstat -tunlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 16372/nginx: master
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1008/sshd
tcp6 0 0 :::3306 :::* LISTEN 1473/mysqld
tcp6 0 0 :::8080 :::* LISTEN 12503/httpd
tcp6 0 0 :::80 :::* LISTEN 16372/nginx: master
tcp6 0 0 :::22 :::* LISTEN 1008/sshd
Conclusion
The above are the basic tools commonly used for Linux network troubleshooting; I hope it will be helpful to you.