Quick Recipes for Finding Intranetwork Issues

Sooner or later you will find issues in your networked hosts; Some times these issues are trivial to diagnose (a machine is really down) but some times it will require everything you know and some more to figure out what is wrong.
In this small piece I’ll show you a few techniques you can use to troubleshoot your hosts issues, hopefully they will short the time you need to figure out what is wrong and how you can fix the issue.
Keep in mind than on an small intranet you have lots more control over the topology and access to the network equipment. If you are talking about a more complex setup (like an inter data-center private cloud) you most likely will have to interact with the networking, security and even application teams.
Also this doesn’t cover issues with advanced containerized applications that create their own virtual networks.
Finally you should know some networking basics that will not be covered here, like what is an IP address, network masks and what is DNS.
Time to move on with the scenarios.

Unknown host

This one is pretty common and most of the time points to simple DNS (the translation of host to IP address)

[josevnz@dmaf5 ~]$ ssh raspberrypi.holmes
ssh: Could not resolve hostname raspberrypi.holmes: Name or service not known

How to confirm this is indeed a bad host name? You can use nslookup and a little experimentation:

[josevnz@dmaf5 ~]$ ssh raspberrypi.holmes
ssh: Could not resolve hostname raspberrypi.holmes: Name or service not known

[josevnz@dmaf5 ~]$ nslookup raspberrypi.holmes
Server: 127.0.0.53
Address: 127.0.0.53#53
** server can't find raspberrypi.holmes: NXDOMAIN

[josevnz@dmaf5 ~]$ nslookup raspberrypi
Server: 127.0.0.53
Address: 127.0.0.53#53
Non-authoritative answer:
Name: raspberrypi.home
Address: 192.168.1.27
Name: raspberrypi.home
Address: fd22:4e39:e630:1:dea6:32ff:fef9:4748

You can see here that by removing the domain name we got the right host. Some times the DNS entry could be indeed missing (and then you need to fix the corresponding A or CNAME record on the name-server). A quick workaround is to add the entry to the /etc/hosts file if you are in a hurry (but this hack will only ‘fix’ the machine where you are currently running the test):

sudo -i

echo ‘192.168.1.27 raspberrypi.holmes raspberrypi’ >> /etc/hosts

What if you want to inspect the whole zone? If you have an intranet DNS server you may be allowed to run a ‘Zone transfer’ and check the contents of the domain yourself for errors:

dig AXFR yourdomain.com @$nameserver

You can find which one is your ‘$nameserver’ by checking your /etc/resolv.conf. If you have 127.0.0.53 you may be using the local resolver, so a little bit of digging is required:

[josevnz@dmaf5 ~]$ sudo systemd-resolve --status
[sudo] password for josevnz:
Global
Protocols: LLMNR=resolve -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Link 2 (enp2s0)
Current Scopes: none
Protocols: -DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Link 3 (eno1)
Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6
Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: fd22:4e39:e630:1:3a5b:5eff:fe1d:5272
DNS Servers: 192.168.1.1 fd22:4e39:e630:1:3a5b:5eff:fe1d:5272
DNS Domain: home
Link 4 (wlp4s0)
Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6
Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 192.168.1.1
DNS Servers: 192.168.1.1 fd22:4e39:e630:1:3a5b:5eff:fe1d:5272
DNS Domain: home
Link 5 (docker0)
Current Scopes: LLMNR/IPv4 LLMNR/IPv6
Protocols: -DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Link 7 (vethc560c19)
Current Scopes: LLMNR/IPv6
Protocols: -DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported

You can see that there are several (or the same) DNS server for our links: 192.168.1.1. That is the real DNS server you are using.
Keep in mind zone transfers may not be allowed or supported, so you may need to reach out to the DNS administrator:

[josevnz@dmaf5 ~]$ dig AXFR home @192.168.1.1
;; Connection to 192.168.1.1#53(192.168.1.1) for home failed: connection refused.
;; Connection to 192.168.1.1#53(192.168.1.1) for home failed: connection refused.
;; Connection to 192.168.1.1#53(192.168.1.1) for home failed: connection refused.

Bad cable problem

It is an oversimplification, but you can use Iperf3 to test bandwidth capacity or to trigger errors in your network by flooding it with packets in order to make this errors more obvious. For example, you can start a server on your remote host:

[root@macmini2 ~]# iperf3 --server
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------

[josevnz@dmaf5 SuricataLog]$ iperf3 --client macmini2 --time 30 --zerocopy
Connecting to host macmini2, port 5201
[ 5] local 192.168.1.26 port 35778 connected to 192.168.1.16 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 2.08 MBytes 17.5 Mbits/sec 0 121 KBytes
[ 5] 1.00-2.00 sec 2.04 MBytes 17.1 Mbits/sec 0 208 KBytes
[ 5] 2.00-3.00 sec 2.10 MBytes 17.6 Mbits/sec 0 302 KBytes
[ 5] 12.00-13.00 sec 2.50 MBytes 21.0 Mbits/sec 109 622 KBytes
[ 5] 13.00-14.00 sec 2.50 MBytes 21.0 Mbits/sec 510 451 KBytes
# ... Omiting output
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 71.6 MBytes 20.0 Mbits/sec 657 sender
[ 5] 0.00-30.02 sec 69.1 MBytes 19.3 Mbits/sec receiver

In here the retransmission (Retr) should be as close to zero.
You can also ‘reverse’ the traffic flow by passing -R (and the server sends data back to you).
So how do you test for a bad cable problem? You need to look for CRC errors. Ethtool can help:

josevnz@raspberrypi:~$ sudo ethtool -S eth0|grep -P -i 'crc|error'
rx_errors: 0
tx_errors: 0
rx_crc: 0
rxq0_errors: 0
rxq1_errors: 0
rxq2_errors: 0
rxq3_errors: 0
rxq16_errors: 0

These counters should be zero, specially the rx_crc. A counter higher than zero could indicate a bad cable, bad network card or a bad port on the switch.
A way to confirm if you have an issue goes like this:

Take the interface snapshot like this: sudo ethtool -S eth0|grep -P -i ‘crc|error’> start.txt
Run Iperf3 for at least 5 minutes on using a server/ client pair. The client can be on the machine where you saw the rx_crc counter
Take another ethtool snapshot and compare the counters. Did they increase after you run iperf3?

It is always useful to see if your switch is showing any errors on the ports. Depending of your switch vendor (Arista, Cisco) you maye need to try different things.
Be ready to work with your network administrators.

How to fix

Like all the network issues, this one requires trial and error:

You always start by replacing the cheapest component, cables.
If issue persists then you swap network cards.
Nothing works so far, then move the connection to a different switch physical port; if the issue gets fixed by switching switch ports, do you see the CRC errors on other servers that are connected th the same switch? How the stats look on the switch port?

Slow connectivity

Using ping, traceroute and tracepath

One of the easiest way to test connectivity is using ping:
ping uses the ICMP protocol’s mandatory ECHO_REQUEST datagram to elicit an ICMP ECHO_RESPONSE from a host or gateway.
It is also comes specially handy to test connectivity issues if you know how long in average is your round trip between two hosts in your intranet:

[josevnz@dmaf5 ~]$ ping -c 50 macmini2
PING macmini2 (192.168.1.16) 56(84) bytes of data.
64 bytes from macmini2 (192.168.1.16): icmp_seq=1 ttl=64 time=4.74 ms
64 bytes from macmini2 (192.168.1.16): icmp_seq=2 ttl=64 time=0.528 ms
64 bytes from macmini2 (192.168.1.16): icmp_seq=3 ttl=64 time=0.655 ms
64 bytes from macmini2 (192.168.1.16): icmp_seq=5 ttl=64 time=0.665 ms
64 bytes from macmini2 (192.168.1.16): icmp_seq=6 ttl=64 time=1.29 ms
64 bytes from macmini2 (192.168.1.16): icmp_seq=7 ttl=64 time=0.494 ms
64 bytes from macmini2 (192.168.1.16): icmp_seq=8 ttl=64 time=0.795 ms
# ... Ommiting output for clarity
--- macmini2 ping statistics ---
50 packets transmitted, 50 received, 0% packet loss, time 50158ms
rtt min/avg/max/mdev = 0.417/0.786/4.026/0.676 ms

A round trip of 0.786 ms, doesn’t look bad. Let’s overload the network on purpose to illustrate how the round trip of icmp packets can be affected by heavy traffic.
On the server we run iperf3:

[root@macmini2 ~]# iperf3 --server
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------

And on the client we will run it for 5 minutes (300 seconds):

[root@macmini2 ~]# iperf3 --server
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------

And in a new terminal on the client we re-run our ping command:

[josevnz@dmaf5 ~]$ ping -c 50 macmini2
PING macmini2 (192.168.1.16) 56(84) bytes of data.
64 bytes from macmini2 (192.168.1.16): icmp_seq=1 ttl=64 time=187 ms
64 bytes from macmini2 (192.168.1.16): icmp_seq=3 ttl=64 time=91.5 ms
64 bytes from macmini2 (192.168.1.16): icmp_seq=5 ttl=64 time=166 ms
# ... Omiting output
--- macmini2 ping statistics ---
50 packets transmitted, 32 received, 36% packet loss, time 49672ms
rtt min/avg/max/mdev = 30.867/156.345/439.803/107.634 ms

Look at those times jumping. On average now 156.345 ms, and not only that we lost 36% of our packets. Not good.
Ping with ICMP packets won’t tell you if the issue is cause by an extra hop between you and your destination.
Another tool we can use is traceroute, which is more useful as you can see how many hops are between your client and your destination:
traceroute tracks the route packets taken from an IP network on their way to a given host. It utilizes the IP protocol’s time to live (TTL) field and attempts to elicit an ICMP TIME_EXCEEDED response from each gateway along the path to the host. Let’s use traceroute for that without DNS resolution to speed up the results (showing an unloaded network below):

[josevnz@dmaf5 ~]$ traceroute -n macmini2
traceroute to macmini2 (192.168.1.16), 30 hops max, 60 byte packets
1 192.168.1.16 1.182 ms 1.176 ms 1.272 ms

Single host, 1.176 ms. Running traceroute when iperf3 is also flooding the network:

traceroute to macmini2 (192.168.1.16), 30 hops max, 60 byte packets
1 192.168.1.16 73.463 ms 73.505 ms 73.580 ms

All the way up to 73.505 ms. You noticed than traceroute doesn’t show you lost packets like ping.
Another option to traceroute is tracepath; it has less options than traceroute and also doesn’t require elevated privileges to run unlike some flags in traceroute.

[josevnz@dmaf5 ~]$ tracepath -n macmini2
1?: [LOCALHOST] pmtu 1500
1: 192.168.1.16 1.300ms reached
1: 192.168.1.16 2.116ms reached
Resume: pmtu 1500 hops 1 back 1

How to fix

This one is tricky, as it may be a issue with intermediate routers or even the destination target.
Things you can check:

How much bandwidth is being used? This may require you measure at the switch level to see if the path is not saturated.
If you see lost packets then maybe replacing the cable is all you need
You can check also for errors on the network interfaces. Then you may need to replace the network card or before trying that just switch ports on the destination switch.

Service is not reachable on destination port

There are many ways to check for this error, the first one if using the client with the proper protocol. For example, say it is a web server that is not responding

[josevnz@dmaf5 ~]$ curl --verbose --fail http://raspberrypi.home:8000/
* Trying fd22:4e39:e630:1:dea6:32ff:fef9:4748:8000...
* connect to fd22:4e39:e630:1:dea6:32ff:fef9:4748 port 8000 failed: Connection refused
* Trying 192.168.1.27:8000...
* connect to 192.168.1.27 port 8000 failed: Connection refused
* Failed to connect to raspberrypi.home port 8000 after 4 ms: Connection refused
* Closing connection 0
curl: (7) Failed to connect to raspberrypi.home port 8000 after 4 ms: Connection refused

A more generic test can be done with Netcat:

[josevnz@dmaf5 ~]$ nc raspberrypi.home 8000
Ncat: Connection refused.

In this case the server refused our TCP handshake. But what if your nc command just hangs and eventually times out?

[josevnz@dmaf5 ~]$ nc raspberrypi.home 8000
Ncat: TIMEOUT.

There is a chance than you have port filter (firewalling) and your request is being rejected. This requires opening the port (if is using iptables or at the firewall level if the real host is behind a firewall). If you have nmap you can confirm if the port is being filtered:

[josevnz@dmaf5 ~]$ sudo nmap raspberrypi.home -p 8000
Starting Nmap 7.93 ( https://nmap.org ) at 2023-01-08 05:59 EST
Nmap scan report for raspberrypi.home (192.168.1.16)
Host is up (0.024s latency).
PORT STATE SERVICE
8000/tcp filtered http-proxy
Nmap done: 1 IP address (1 host up) scanned in 0.72 seconds

Wrapping up

This is just a quick introduction to some common techniques used to troubleshoot common network problems. There many more protocols like UDP that were not covered here and require slightly different tools and approaches, for that is always good to become familiar with the protocol and service with the problem and get familiar with the expected proper behavior.