As I’m delivering my Linux Troubleshooting training soon, I am going to blog about some typical issues and techniques we’ll troubleshoot in the class too.
I’ll start from a relatively simple problem - logging in to a server via SSH always takes 10 seconds. The delay seems to be pretty constant, there don’t seem to be major network problems and the server is not overloaded. Yet, remote logins always take 10 seconds.
If you’be been around, you probably already know a couple of likely causes for this, but I want to approach this problem systematically. How to troubleshoot such issues without relying on lucky guesses or having previous experience with usual suspects. You never know, next time the root cause may be different - or you have to troubleshoot a completely different application.
Anyway, let’s get started, I’ll use the time
command to measure how long it takes to SSH to my server and run the hostname command there:
$ time ssh demo@linux01 hostname linux01.localdomain real 0m10.282s user 0m0.013s sys 0m0.009s
The command above takes a bit over 10 seconds every time I execute it. We can’t immediately assume that this must be a server side thing (clients can also have problems), so let’s drill down deeper.
I’ll use “application instrumentation” built in to the SSH client itself, namely the -vvv
“very verbose” flag. As there’s a lot of output, I’ve removed some irrelevant sections and replaced them with “…” and inline comments are in italic:
$ ssh -vvv demo@linux01 hostname OpenSSH_7.4p1, OpenSSL 1.0.2k-fips 26 Jan 2017 debug1: Reading configuration data /etc/ssh/ssh_config debug1: /etc/ssh/ssh_config line 58: Applying options for * debug2: resolving "linux01" port 22 debug2: ssh_connect_direct: needpriv 0 debug1: Connecting to linux01 [192.168.0.50] port 22. <-- looks like the TCP connection debug1: Connection established. <-- was established OK debug1: identity file /home/demo/.ssh/id_rsa type 1 debug1: key_load_public: No such file or directory ... debug2: key: /home/demo/.ssh/id_rsa (0x55a289ec2360) debug3: send packet: type 5 debug3: receive packet: type 7 debug1: SSH2_MSG_EXT_INFO received <-- But the SSH "app level" ... debug1: kex_input_ext_info: server-sig-algs=<rsa-sha2-256,rsa-sha2-512> debug3: receive packet: type 6 <-- ... handshake continues debug2: service_accept: ssh-userauth debug1: SSH2_MSG_SERVICE_ACCEPT received debug3: send packet: type 50 <-- after this line we get <-- the 10 second hang! debug3: receive packet: type 51 debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic,password debug3: start over, passed a different list publickey,gssapi-keyex,gssapi-with-mic,password debug3: preferred gssapi-keyex,gssapi-with-mic,publickey,keyboard-interactive,password debug3: authmethod_lookup gssapi-keyex ...
So, the SSH verbose mode log is telling us that:
- The TCP-level connection is successfully established
- The SSH application-level handshake continues
- The SSH client sends a packet to the server and hangs for 10 seconds
(send packet: type 50)
- The server response arrives and everything proceeds normally
(receive packet: type 51)
So, we are waiting for 10 seconds for the server to get back to us with some kind of response (after successfully establishing a TCP connection). This doesn’t seem to be a network problem, but rather a server-side issue with sshd
.
One might raise the question of whether to trust the SSH client logs, maybe there’s something going on under the hood that we don’t see from the client logs and it’s not a server problem. This is a valid question, but in the interest of brevity, I choose to trust the application logs for now and take a look into the server next.
So let’s log in to the server and look into the SSH daemon there:
$ ps -ef --sort start_time | grep [s]shd root 10635 1 0 Mar05 ? 00:00:00 /usr/sbin/sshd -D <-- the daemon that root 2628 10635 0 12:24 ? 00:00:00 sshd: tanel [priv] listens on port 22 tanel 2766 2628 0 12:24 ? 00:00:00 sshd: tanel@pts/0 root 16695 10635 0 14:40 ? 00:00:00 sshd: tanel [priv] tanel 16796 16695 0 14:40 ? 00:00:00 sshd: tanel@pts/1
The above output is the “normal” state for my server, where no-one is waiting to log in. Now let’s run the earlier SSH command from the client side again and see what happens:
$ ps -ef --sort start_time | grep [s]shd root 10635 1 0 Mar05 ? 00:00:00 /usr/sbin/sshd -D root 2628 10635 0 12:24 ? 00:00:00 sshd: tanel [priv] tanel 2766 2628 0 12:24 ? 00:00:00 sshd: tanel@pts/0 root 16695 10635 0 14:40 ? 00:00:00 sshd: tanel [priv] tanel 16796 16695 0 14:40 ? 00:00:00 sshd: tanel@pts/1 root 21515 10635 0 14:48 ? 00:00:00 sshd: [accepted] sshd 21516 21515 0 14:48 ? 00:00:00 sshd: [net]
A couple of new processes have appeared. Looks like sshd
creates a new process for the incoming connection and helpfully renames the process “comment” to sshd: [accepted]
so we could see in what stage this process currently is (PostgreSQL also does this and it makes getting some visibility inside the app from OS level much easier). However, we still don’t know why one (or both) processes hang for this 10 seconds. Let’s drill even deeper and measure what these server side processes are doing during a login operaton.
Now is the time to say that there are multiple ways to dig down deeper, multiple tools that can be used. I will start from one of the oldest Linux tracing tools available, called strace
. It is a great tool for quickly tracing the system calls issued by a process and optionally its child processes too.
Note that
strace
is not a 100% safe tool for use on critical processes in production. It uses the Linuxptrace()
system call under the hood that allows peeking inside other processes and even modifying their memory and registers. However, there’s a small chance of messing up the signal handling or crashing that busy process whilestrace
runs. Additionally,strace
will greatly slow down the traced process while enabled. Nevertheless, it is often the quickest and most practical tool for troubleshooting what a process is up to, especially in non-production. It is also one of the few options you have without requiring root privileges for installing or accessing more advanced tools. I will write about other techniques in upcoming articles.
I will run the following command on the main SSH daemon process (pid 10635). I will have to run it as root as the target process is owned by root. You can just man strace
to learn more about it, so here I will explain only the command line options that I’ll use below:
-r
: Print elapsed time between system call starts (in the 2nd column in the output below)-T
: Print the elapsed times (durations) of syscalls (the rightmost fields like “<0.000364>”)-f
: Follow and trace child processes too-p
: Attach to PID 10635 and trace that
Ok, let’s do some tracing!
$ sudo strace -r -T -f -p 10635 strace: Process 10635 attached 0.000000 select(8, [3 4], NULL, NULL, NULL) = 1 (in [3]) <3.491450> 3.491638 accept(3, {sa_family=AF_INET, sin_port=htons(65534), sin_addr=inet_addr("192.168.0.179")}, [16]) = 5 <0.000364> ...
The select
syscall above (waiting for data from file descriptors) took 3.49 seconds to complete simply because it took me over 3 seconds to run the SSH command from my other terminal window after starting tracing. Then accept
accepted a connection on AF_INET type (TCP or UDP) socket from the IP 192.168.0.179.
... 0.000470 fcntl(5, F_GETFL) = 0x2 (flags O_RDWR) <0.000013> 0.000039 pipe([6, 7]) = 0 <0.000018> 0.000042 socketpair(AF_LOCAL, SOCK_STREAM, 0, [8, 9]) = 0 <0.000391> 0.000415 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f720cf74b90) = 18978 <0.000809>
The clone()
syscall above essentially forks a new child process. It creates threads within an existing process address space too, but in this case there’s no CLONE_VM
flag specified, so we get a whole new process with its own virtual address space. Since there’s more than one process/task to trace now, strace
starts prefixing the output line with the corresponding PID (to be precise, it’s the linux task ID, but let’s talk about that some other day).
strace: Process 18978 attached [pid 10635] 0.000871 close(7 <unfinished ...> [pid 10635] 0.000012 <... close resumed> ) = 0 <0.000030> [pid 10635] 0.000027 write(8, "\0\0\2\276\0", 5) = 5 <0.000018> [pid 10635] 0.000045 write(8, "\0\0\2\271\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHostKey"..., 701 <unfinished ...> [pid 18978] 0.000022 open("/proc/self/oom_score_adj", O_WRONLY|O_CREAT|O_TRUNC, 0666 <unfinished ...> [pid 10635] 0.000031 <... write resumed> ) = 701 <0.000042> ... [pid 18978] 0.000034 execve("/usr/sbin/sshd", ["/usr/sbin/sshd", "-D", "-R"], [/* 4 vars */]) = 0 <0.000604> [pid 18978] 0.000666 brk(NULL) = 0x5556214a6000 <0.000013> [pid 18978] 0.000053 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f549d63b000 <0.000014> ...
The execve()
above is where the new child process starts executing the /usr/bin/sshd
binary how it wants and allocates some memory too (brk()
, mmap()
with the MAP_PRIVATE|MAP_ANONYMOUS
flags). And now we are finally getting to the interesting part, given that we are troubleshooting SSH login delay. All other system calls have been executed very quickly, but if you look into the highlighted timing columns below, you’ll see that there are 2 poll()
syscalls that each took around 5 seconds before timing out:
[pid 18978] 0.000067 socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 4 <0.000027> [pid 18978] 0.000050 connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("1.2.3.4")}, 16) = 0 <0.000019> [pid 18978] 0.000047 poll([{fd=4, events=POLLOUT}], 1, 0) = 1 ([{fd=4, revents=POLLOUT}]) <0.000013> [pid 18978] 0.000042 sendto(4, "\316\342\1\0\0\1\0\0\0\0\0\0\003179\0010\003168\003192\7in-ad"..., 44, MSG_NOSIGNAL, NULL, 0) = 44 <0.000030> [pid 18978] 0.000055 poll([{fd=4, events=POLLIN}], 1, 5000) = 0 (Timeout) <5.005176> [pid 18978] 5.005277 poll([{fd=4, events=POLLOUT}], 1, 0) = 1 ([{fd=4, revents=POLLOUT}]) <0.000039;> [pid 18978] 0.000137 sendto(4, "\316\342\1\0\0\1\0\0\0\0\0\0\003179\0010\003168\003192\7in-ad"..., 44, MSG_NOSIGNAL, NULL, 0) = 44 <0.000075> [pid 18978] 0.000151 poll([{fd=4, events=POLLIN}], 1, 5000) = 0 (Timeout) <5.005136> [pid 18978] 5.005220 close(4) = 0 <0.000023> [pid 18978] 0.000074 open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 4 <0.000167> [pid 18978] 0.000204 fstat(4, {st_mode=S_IFREG|0644, st_size=97160, ...}) = 0 <0.000014> [pid 18978] 0.000044 mmap(NULL, 97160, PROT_READ, MAP_PRIVATE, 4, 0) = 0x7f549d623000 <0.000019> [pid 18978] 0.000044 close(4) = 0 <0.000012> ...
You’ll need to look into the rightmost highlighted numbers (like “<5.005176>”) to see the duration of a given system call as the highlighted column in the left attributes the elapsed time to the next event after the previous system call. So in both cases, the poll([{fd=4, events=POLLIN}], 1, 5000)
syscalls were waiting for 5 seconds each. So, POLLIN means that they were trying to read something from somewhere when they ended up with a timeout. Luckily the first argument fd=4
tells us where we tried to read from - file descriptor number 4.
All we need to do now is walk backward in the trace to find what that file descriptor #4 actually pointed to in that process at the time. You’d just scroll upwards in the file and eyeball & search for syscalls like open()
, socket()
, dup()
that return file descriptors. Or with bigger trace files, just search for = 4
backward in the file to find the immediately preceding syscall that created and returned file descriptor 4 as a result. You’d need to make sure that you find the closest preceding syscall that returns 4 as a file descriptor number (not some other random syscall that happens to return a number 4 with a different meaning). man syscalls
and man <syscall_name>
provide all the info you need.
I’m pasting the top 2 lines from the previous output here again for reference (formatted fit on screen):
[pid 18978] 0.000067 socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC| SOCK_NONBLOCK, IPPROTO_IP) = 4 <0.000027> [pid 18978] 0.000050 connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("1.2.3.4") }, 16) = 0 <0.000019>
When I searched for = 4
backward, starting from the slow poll()
syscall in the trace, I found the socket()
syscall highlighted above. It has created an “internet” socket (AF_INET) and it’s not a TCP socket, but a UDP one (due to the SOCK_DGRAM flag). The next connect()
system call associates this socket’s connection with IP address 1.2.3.4 & port 53. Because of the stateless nature of UDP “connections”, just running connect()
for UDP sockets will not actually check if the target host exists and is responding. That’s why the connect()
syscall “succeeded” in just 19 microseconds.
So I could ping the host 1.2.3.4, but this isn’t fully systematic as our SSH Daemon is not timing out when trying to ping the server. Additionally, some servers (or firewalls) may have ping ICMP packet traffic blocked, so it may look like the whole server is down, but it’s just the ping packets that don’t go through.
We could just google what the port 53 typically means, but there’s plenty of information stored within Linux itself:
$ grep " 53/" /etc/services domain 53/tcp # name-domain server domain 53/udp
It’s the DNS service port! It’s a DNS lookup! Apparently, the SSH Daemon is trying to talk to a DNS server (on IP 1.2.3.4) whenever my test user is trying to log in. Many of you probably knew from the start that a usual suspect causing such problems is a reverse DNS lookup where the daemon wants to find the fully qualified domain name of whatever client IP address is connected to it. This is useful for logging purposes (but less useful for security & whitelisting as DNS could be hijacked/spoofed).
Let’s use nslookup
to do a DNS lookup using IP 1.2.3.4:
$ nslookup
> server 1.2.3.4
Default server: 1.2.3.4
Address: 1.2.3.4#53
>
> blog.tanelpoder.com
;; connection timed out; no servers could be reached
>
Indeed, the request hung for 10 seconds and then timed out. So this server is down (or there’s some network or configuration issue going on). Let’s check what is the default nameserver for my server where the SSH daemon runs:
$ cat /etc/resolv.conf
search localdomain
nameserver 1.2.3.4
Ok, “someone” has configured this server’s default nameserver to IP 1.2.3.4. And since it’s not responding whenever SSH Daemon attempts its reverse lookup, we have this problem. So all I need to do is to configure my server to use a functioning nameserver. In my case I changed the nameserver
value to 192.168.0.1 and problem solved:
$ time ssh demo@linux01 hostname linux01.localdomain real 0m0.273s user 0m0.013s sys 0m0.010s
You may have seen such problems in your environment in past, typically it happens when the enterprise networking folks change some firewall rules and a server in one region can’t access a DNS server (in another region) anymore. Or when your computer’s VPN/DHCP client messes around with your nameserver settings on connect (but doesn’t change them back when disconnecting from the VPN).
Appendix: Analyzing Large Trace Files
If you experiment with strace
, you’ll quickly see that it writes a lot of output. You can ask strace
to trace only specific system calls of interest, however in our scenario, we didn’t know which syscall type took time in advance. So, full tracing with some postprocessing may be needed in order to find the longest-taking system calls out of a big tracefile. Here are a couple of examples of writing the output trace to a file (-o /tmp/x.trc
) and how to extract the top syscall durations from the tracefile:
$ sudo strace -o /tmp/x.trc -r -T -f -p 10635
Now we have the trace output written to a file. The 2nd column shows the time between when the trace lines were written (between syscall starts):
$ tail /tmp/x.trc 5206 0.000037 close(5) = 0 <0.000013> 5206 0.000035 exit_group(255) = ? 5206 0.000771 +++ exited with 255 +++ 10635 0.000017 <... select resumed> ) = ? ERESTARTNOHAND (To be restarted if no handler) <0.361501>
So we could just use sort -k 2
to sort this text based on column 2, also treat the values as numeric when sorting (-n
):
$ sort -nrk 2 /tmp/x.trc | head 5206 5.005217 poll([{fd=4, events=POLLOUT}], 1, 0) = 1 ([{fd=4, revents=POLLOUT}]) <0.000026> 5206 5.005187 close(4) = 0 <0.000021> 10635 3.178587 accept(3, {sa_family=AF_INET, sin_port=htons(55183), sin_addr=inet_addr("192.168.0.179")}, [16]) = 5 <0.000428> 5206 0.048381 read(4, "14\0unconfined_u:system_r:oddjob_"..., 4095) = 756 <0.000012> 5207 0.040639 read(3, "$\365a\306Zqa>\240\31\10\203\306\367\2201o?\345\\\22\7r\226K;\215`\330i\342P"..., 8192) = 44 <0.000013> 5206 0.014828 recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1`\0\0\0\215%\0\0=\0\0\0\5\1u\0\2\0\0\0", 24}], msg_controllen=24, [{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, [5]}], msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24 <0.000142> 5207 0.005040 write(4, "\0\0\0@\6", 5) = 5 <0.000022> 5207 0.003654 read(3, "\0\0\0\f\n\25\0\0\0\0\0\0\0\0\0\0", 8192) = 16 <0.000015> 5207 0.002423 read(3, "B\245z_+\315tb\241\220$\\\35f\250\30\274\266p\312\201\335\20\243A\377\374\17\177V6*"..., 8192) = 652 <0.000015> 5207 0.001710 read(3, "\0\0\0,\6\36\0\0\0 ,u\201\32$\26\27\20b\214&\246\265T\267u\243\242\2625\227\231"..., 8192) = 48 <0.000018>
The above output shows you the trace lines that had the biggest time since the syscall before them started (thanks to the strace -r
option). So the close(4)
syscall itself didn’t take 5 seconds, but the previous syscall before that (and whatever userland processing, scheduling delays etc) took 5 seconds. You would need to open the .trc
file with an editor and search for the time taken (or use grep -B 1) to see what came just before these lines in the trace.
Another approach that uses the actual syscall duration (the rightmost field in trace output written thanks to the strace -T
option) would require a bit of text processing with AWK:
$ awk '{ gsub("<", "", $NF) ; print $NF,$0 }' /tmp/x.trc | sort -nr | head 10.256167> 10635 0.000032 <... select resumed> ) = 1 (in [6]) 10.256167> 10.014919> 5207 0.000034 <... read resumed> "\0\0j%", 4) = 4 10.014919> 5.005120> 5206 0.000051 poll([{fd=4, events=POLLIN}], 1, 5000) = 0 (Timeout) 5.005120> 5.005108> 5206 0.000049 poll([{fd=4, events=POLLIN}], 1, 5000) = 0 (Timeout) 5.005108> 3.178382> 10635 0.000000 select(8, [3 4], NULL, NULL, NULL) = 1 (in [3]) 3.178382> 0.361501> 10635 0.000017 <... select resumed> ) = ? ERESTARTNOHAND (To be restarted if no handler) 0.361501> 0.250243> 5206 0.000041 <... poll resumed> ) = 1 ([{fd=7, revents=POLLIN}]) 0.250243> 0.248214> 5274 0.000039 <... select resumed> ) = 1 (in [13]) 0.248214> 0.181922> 5275 0.000059 <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 1}], 0, NULL) = 5284 0.181922> 0.179580> 5284 0.000016 <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 5285 0.179580>
The command above sorts the trace by the last field of each line, the individual system call duration. Again, you could use the precise duration value (like 5.005108) to find the location of interest in the full tracefile for detailed analysis. Or make AWK or grep print out the tracefile line number somewhere too. Note that the 10+ second syscalls in the top of the output are due to me taking 10 seconds before issuing the demo SSH command in another terminal window (sshd
had been just idle, waiting for anyone to connect and talk to it when I started tracing 10 seconds before).
Summary
This concludes the systematic troubleshooting example in this article. strace
is just one of the potential tools you could use, I will cover some other options in the future. Often it is “cheaper” to start from checking for a couple of usual suspects (from experience), before starting to trace stuff. However, not all problems fall into the “usual suspects” category, so you need a systematic troubleshooting approach to be ready to deal with any issue.
If you want to get notified of new troubleshooting articles, you can follow me on Twitter - or sign up for weekly updates by email on the Connect page.