
Quick Recipes for Finding Intranetwork Issues
Sooner or later you will find issues in your networked hosts; Some times these issues are trivial to diagnose (a machine is really down) but some times it will require everything you know and some more to figure out what is wrong.
In this small piece I’ll show you a few techniques you can use to troubleshoot your hosts issues, hopefully they will short the time you need to figure out what is wrong and how you can fix the issue.
Keep in mind than on an small intranet you have lots more control over the topology and access to the network equipment. If you are talking about a more complex setup (like an inter data-center private cloud) you most likely will have to interact with the networking, security and even application teams.
Also this doesn’t cover issues with advanced containerized applications that create their own virtual networks.
Finally you should know some networking basics that will not be covered here, like what is an IP address, network masks and what is DNS.
Time to move on with the scenarios.
Unknown host
This one is pretty common and most of the time points to simple DNS (the translation of host to IP address)
[josevnz@dmaf5 ~]$ ssh raspberrypi.holmes ssh: Could not resolve hostname raspberrypi.holmes: Name or service not known
How to confirm this is indeed a bad host name? You can use nslookup and a little experimentation:
[josevnz@dmaf5 ~]$ ssh raspberrypi.holmes ssh: Could not resolve hostname raspberrypi.holmes: Name or service not known
[josevnz@dmaf5 ~]$ nslookup raspberrypi.holmes Server: 127.0.0.53 Address: 127.0.0.53#53 ** server can't find raspberrypi.holmes: NXDOMAIN
[josevnz@dmaf5 ~]$ nslookup raspberrypi Server: 127.0.0.53 Address: 127.0.0.53#53 Non-authoritative answer: Name: raspberrypi.home Address: 192.168.1.27 Name: raspberrypi.home Address: fd22:4e39:e630:1:dea6:32ff:fef9:4748
You can see here that by removing the domain name we got the right host. Some times the DNS entry could be indeed missing (and then you need to fix the corresponding A or CNAME record on the name-server). A quick workaround is to add the entry to the /etc/hosts file if you are in a hurry (but this hack will only ‘fix’ the machine where you are currently running the test):
sudo -i
echo ‘192.168.1.27 raspberrypi.holmes raspberrypi’ >> /etc/hosts
What if you want to inspect the whole zone? If you have an intranet DNS server you may be allowed to run a ‘Zone transfer’ and check the contents of the domain yourself for errors:
dig AXFR yourdomain.com @$nameserver
You can find which one is your ‘$nameserver’ by checking your /etc/resolv.conf. If you have 127.0.0.53 you may be using the local resolver, so a little bit of digging is required:
[josevnz@dmaf5 ~]$ sudo systemd-resolve --status [sudo] password for josevnz: Global Protocols: LLMNR=resolve -mDNS -DNSOverTLS DNSSEC=no/unsupported resolv.conf mode: stub Link 2 (enp2s0) Current Scopes: none Protocols: -DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported Link 3 (eno1) Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6 Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported Current DNS Server: fd22:4e39:e630:1:3a5b:5eff:fe1d:5272 DNS Servers: 192.168.1.1 fd22:4e39:e630:1:3a5b:5eff:fe1d:5272 DNS Domain: home Link 4 (wlp4s0) Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6 Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported Current DNS Server: 192.168.1.1 DNS Servers: 192.168.1.1 fd22:4e39:e630:1:3a5b:5eff:fe1d:5272 DNS Domain: home Link 5 (docker0) Current Scopes: LLMNR/IPv4 LLMNR/IPv6 Protocols: -DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported Link 7 (vethc560c19) Current Scopes: LLMNR/IPv6 Protocols: -DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
You can see that there are several (or the same) DNS server for our links: 192.168.1.1. That is the real DNS server you are using.
Keep in mind zone transfers may not be allowed or supported, so you may need to reach out to the DNS administrator:
[josevnz@dmaf5 ~]$ dig AXFR home @192.168.1.1 ;; Connection to 192.168.1.1#53(192.168.1.1) for home failed: connection refused. ;; Connection to 192.168.1.1#53(192.168.1.1) for home failed: connection refused. ;; Connection to 192.168.1.1#53(192.168.1.1) for home failed: connection refused.
Bad cable problem
It is an oversimplification, but you can use Iperf3 to test bandwidth capacity or to trigger errors in your network by flooding it with packets in order to make this errors more obvious. For example, you can start a server on your remote host:
[root@macmini2 ~]# iperf3 --server ----------------------------------------------------------- Server listening on 5201 -----------------------------------------------------------
[josevnz@dmaf5 SuricataLog]$ iperf3 --client macmini2 --time 30 --zerocopy Connecting to host macmini2, port 5201 [ 5] local 192.168.1.26 port 35778 connected to 192.168.1.16 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 2.08 MBytes 17.5 Mbits/sec 0 121 KBytes [ 5] 1.00-2.00 sec 2.04 MBytes 17.1 Mbits/sec 0 208 KBytes [ 5] 2.00-3.00 sec 2.10 MBytes 17.6 Mbits/sec 0 302 KBytes [ 5] 12.00-13.00 sec 2.50 MBytes 21.0 Mbits/sec 109 622 KBytes [ 5] 13.00-14.00 sec 2.50 MBytes 21.0 Mbits/sec 510 451 KBytes # ... Omiting output - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-30.00 sec 71.6 MBytes 20.0 Mbits/sec 657 sender [ 5] 0.00-30.02 sec 69.1 MBytes 19.3 Mbits/sec receiver
In here the retransmission (Retr) should be as close to zero.
You can also ‘reverse’ the traffic flow by passing -R (and the server sends data back to you).
So how do you test for a bad cable problem? You need to look for CRC errors. Ethtool can help:
josevnz@raspberrypi:~$ sudo ethtool -S eth0|grep -P -i 'crc|error' rx_errors: 0 tx_errors: 0 rx_crc: 0 rxq0_errors: 0 rxq1_errors: 0 rxq2_errors: 0 rxq3_errors: 0 rxq16_errors: 0
These counters should be zero, specially the rx_crc. A counter higher than zero could indicate a bad cable, bad network card or a bad port on the switch.
A way to confirm if you have an issue goes like this:
- Take the interface snapshot like this: sudo ethtool -S eth0|grep -P -i ‘crc|error’> start.txt
- Run Iperf3 for at least 5 minutes on using a server/ client pair. The client can be on the machine where you saw the rx_crc counter
- Take another ethtool snapshot and compare the counters. Did they increase after you run iperf3?
It is always useful to see if your switch is showing any errors on the ports. Depending of your switch vendor (Arista, Cisco) you maye need to try different things.
Be ready to work with your network administrators.
How to fix
Like all the network issues, this one requires trial and error:
- You always start by replacing the cheapest component, cables.
- If issue persists then you swap network cards.
- Nothing works so far, then move the connection to a different switch physical port; if the issue gets fixed by switching switch ports, do you see the CRC errors on other servers that are connected th the same switch? How the stats look on the switch port?
Slow connectivity
Using ping, traceroute and tracepath
One of the easiest way to test connectivity is using ping:
ping uses the ICMP protocol’s mandatory ECHO_REQUEST datagram to elicit an ICMP ECHO_RESPONSE from a host or gateway.
It is also comes specially handy to test connectivity issues if you know how long in average is your round trip between two hosts in your intranet:
[josevnz@dmaf5 ~]$ ping -c 50 macmini2 PING macmini2 (192.168.1.16) 56(84) bytes of data. 64 bytes from macmini2 (192.168.1.16): icmp_seq=1 ttl=64 time=4.74 ms 64 bytes from macmini2 (192.168.1.16): icmp_seq=2 ttl=64 time=0.528 ms 64 bytes from macmini2 (192.168.1.16): icmp_seq=3 ttl=64 time=0.655 ms 64 bytes from macmini2 (192.168.1.16): icmp_seq=5 ttl=64 time=0.665 ms 64 bytes from macmini2 (192.168.1.16): icmp_seq=6 ttl=64 time=1.29 ms 64 bytes from macmini2 (192.168.1.16): icmp_seq=7 ttl=64 time=0.494 ms 64 bytes from macmini2 (192.168.1.16): icmp_seq=8 ttl=64 time=0.795 ms # ... Ommiting output for clarity --- macmini2 ping statistics --- 50 packets transmitted, 50 received, 0% packet loss, time 50158ms rtt min/avg/max/mdev = 0.417/0.786/4.026/0.676 ms
A round trip of 0.786 ms, doesn’t look bad. Let’s overload the network on purpose to illustrate how the round trip of icmp packets can be affected by heavy traffic.
On the server we run iperf3:
[root@macmini2 ~]# iperf3 --server ----------------------------------------------------------- Server listening on 5201 -----------------------------------------------------------
And on the client we will run it for 5 minutes (300 seconds):
[root@macmini2 ~]# iperf3 --server ----------------------------------------------------------- Server listening on 5201 -----------------------------------------------------------
And in a new terminal on the client we re-run our ping command:
[josevnz@dmaf5 ~]$ ping -c 50 macmini2 PING macmini2 (192.168.1.16) 56(84) bytes of data. 64 bytes from macmini2 (192.168.1.16): icmp_seq=1 ttl=64 time=187 ms 64 bytes from macmini2 (192.168.1.16): icmp_seq=3 ttl=64 time=91.5 ms 64 bytes from macmini2 (192.168.1.16): icmp_seq=5 ttl=64 time=166 ms # ... Omiting output --- macmini2 ping statistics --- 50 packets transmitted, 32 received, 36% packet loss, time 49672ms rtt min/avg/max/mdev = 30.867/156.345/439.803/107.634 ms
Look at those times jumping. On average now 156.345 ms, and not only that we lost 36% of our packets. Not good.
Ping with ICMP packets won’t tell you if the issue is cause by an extra hop between you and your destination.
Another tool we can use is traceroute, which is more useful as you can see how many hops are between your client and your destination:
traceroute tracks the route packets taken from an IP network on their way to a given host. It utilizes the IP protocol’s time to live (TTL) field and attempts to elicit an ICMP TIME_EXCEEDED response from each gateway along the path to the host. Let’s use traceroute for that without DNS resolution to speed up the results (showing an unloaded network below):
[josevnz@dmaf5 ~]$ traceroute -n macmini2 traceroute to macmini2 (192.168.1.16), 30 hops max, 60 byte packets 1 192.168.1.16 1.182 ms 1.176 ms 1.272 ms
Single host, 1.176 ms. Running traceroute when iperf3 is also flooding the network:
traceroute to macmini2 (192.168.1.16), 30 hops max, 60 byte packets 1 192.168.1.16 73.463 ms 73.505 ms 73.580 ms
All the way up to 73.505 ms. You noticed than traceroute doesn’t show you lost packets like ping.
Another option to traceroute is tracepath; it has less options than traceroute and also doesn’t require elevated privileges to run unlike some flags in traceroute.
[josevnz@dmaf5 ~]$ tracepath -n macmini2 1?: [LOCALHOST] pmtu 1500 1: 192.168.1.16 1.300ms reached 1: 192.168.1.16 2.116ms reached Resume: pmtu 1500 hops 1 back 1
How to fix
This one is tricky, as it may be a issue with intermediate routers or even the destination target.
Things you can check:
- How much bandwidth is being used? This may require you measure at the switch level to see if the path is not saturated.
- If you see lost packets then maybe replacing the cable is all you need
- You can check also for errors on the network interfaces. Then you may need to replace the network card or before trying that just switch ports on the destination switch.
Service is not reachable on destination port
There are many ways to check for this error, the first one if using the client with the proper protocol. For example, say it is a web server that is not responding
[josevnz@dmaf5 ~]$ curl --verbose --fail http://raspberrypi.home:8000/ * Trying fd22:4e39:e630:1:dea6:32ff:fef9:4748:8000... * connect to fd22:4e39:e630:1:dea6:32ff:fef9:4748 port 8000 failed: Connection refused * Trying 192.168.1.27:8000... * connect to 192.168.1.27 port 8000 failed: Connection refused * Failed to connect to raspberrypi.home port 8000 after 4 ms: Connection refused * Closing connection 0 curl: (7) Failed to connect to raspberrypi.home port 8000 after 4 ms: Connection refused
A more generic test can be done with Netcat:
[josevnz@dmaf5 ~]$ nc raspberrypi.home 8000 Ncat: Connection refused.
In this case the server refused our TCP handshake. But what if your nc command just hangs and eventually times out?
[josevnz@dmaf5 ~]$ nc raspberrypi.home 8000 Ncat: TIMEOUT.
There is a chance than you have port filter (firewalling) and your request is being rejected. This requires opening the port (if is using iptables or at the firewall level if the real host is behind a firewall). If you have nmap you can confirm if the port is being filtered:
[josevnz@dmaf5 ~]$ sudo nmap raspberrypi.home -p 8000 Starting Nmap 7.93 ( https://nmap.org ) at 2023-01-08 05:59 EST Nmap scan report for raspberrypi.home (192.168.1.16) Host is up (0.024s latency). PORT STATE SERVICE 8000/tcp filtered http-proxy Nmap done: 1 IP address (1 host up) scanned in 0.72 seconds
Wrapping up
This is just a quick introduction to some common techniques used to troubleshoot common network problems. There many more protocols like UDP that were not covered here and require slightly different tools and approaches, for that is always good to become familiar with the protocol and service with the problem and get familiar with the expected proper behavior.