Over the years of troubleshooting performance problems in the Unix/Linux world, I have seen multiple cases where a regularly used command line tool in a customer server just stops working for some reason. The tool just returns immediately, doing absolutely nothing. No output printed, no coredumps and the exit code is zero (success!).
This article walks you through a couple of such incidents and in the end I explain how I avoid accidentally doing bad stuff in production in general.
- The mysterious case of a broken application binary
- The mysterious case of a broken OS binary
- How to avoid the file clobbering problem?
- How to stay safe in shell?
- Limiting damage when accidentally running bad stuff
- Reducing the chance of accidentally running bad stuff
- Summary
The mysterious case of a broken application binary
Here’s a (manually reproduced) example of such a problem on a Linux server. The expdp
command is an Oracle database’s high-speed data export tool, but this can happen to any file. Normally the output is this:
oracle@oel7l bin> pwd /u01/app/oracle/product/18.0.0/dbhome_1/bin oracle@oel7l bin> oracle@oel7l bin> expdp help=y Export: Release 18.0.0.0.0 - Production on Tue Mar 16 17:55:35 2021 Version 18.3.0.0.0 Copyright (c) 1982, 2018, Oracle and/or its affiliates. All rights reserved. The Data Pump export utility provides a mechanism for transferring data objects between Oracle databases. The utility is invoked with the following command: Example: expdp scott/tiger DIRECTORY=dmpdir DUMPFILE=scott.dmp ... lots of output removed...
However, when a customer tried to run that command in their environment one morning, this happened:
oracle@oel7l bin> expdp help=y oracle@oel7l bin> oracle@oel7l bin> echo $? 0
The expdp
command had just stopped working overnight! It immediately returned, doing nothing, yet there were no error messages and even the shell command return code $?
showed 0
- success.
Time to dig deeper! Let’s make sure that we are trying to execute the right command in its correct location:
oracle@oel7l bin> pwd /u01/app/oracle/product/18.0.0/dbhome_1/bin oracle@oel7l bin> which expdp /u01/app/oracle/product/18.0.0/dbhome_1/bin/expdp oracle@oel7l bin>
All looks correct so far, let’s take a look into the binary itself:
oracle@oel7l bin> file expdp expdp: empty oracle@oel7l bin> ls -l expdp -rwxr-x--x. 1 oracle oinstall 0 Mar 16 17:55 expdp
Boom! Something has truncated the file to zero bytes! I even see the last modification time of this file, it may give some extra clues regarding what/who it may have been (was there some database software patching/releasing going on then, done manually by humans).
An easy quick check is to take a look into shell history files (like .bash_history
in users’ home directories) in that server. I’ll query my own user history with fc -l
:
oracle@oel7l bin> fc -l 1044 rm ls 1045 cd bin/ 1046 pwd 1047 expdp help=y 1048 oracle@oel7l bin> pwd 1049 /u01/app/oracle/product/18.0.0/dbhome_1/bin 1050 oracle@oel7l bin> 1051 oracle@oel7l bin> expdp help=y 1052 Export: Release 18.0.0.0.0 - Production on Tue Mar 16 17:55:35 2021 1053 Version 18.3.0.0.0 1054 Copyright (c) 1982, 2018, Oracle and/or its affiliates. All rights reserved. 1055 expdp help=y 1056 which expdp 1057 file expdp 1058 ls -l expdp 1059 pwd oracle@oel7l bin>
Whoa, the highlighted commands don’t look like proper shell commands at all! They actually look like someone had accidentally pasted some random junk from their terminal screen back to the shell as commands!
Now, all of of these highlighted “commands” would have just errored out as they are some random terminal output, not valid shell commands, right? For example, trying to execute this command would give you an error…
$ oracle@oel7l bin -bash: oracle@oel7l: command not found
… BUT, some of our commands have “>” signs in them too!
$ oracle@oel7l bin> expdp help=y
The above command itself did not succeed because of the abovementioned “command not found” shell error, but the output of the failed command (zero bytes) will be still written into whatever file name comes after that “>” redirection character. Now, if you happen to be in your home directory or ‘/tmp’, you’d be just accidentally creating a new empty file called expdp
there. However, when you happen to be in your application binary directory (like I was in this case) or used the full executable path in your previous commands, you will end up clobbering the existing binary file. You’ll truncate it and replace it with a zero byte file. And from then on, your shell happily executes the zero-byte “shell” script and returns success in doing nothing.
From my terminal scroll-back history of this experiment, you’ll see me accidentally pasting lots of junk from the screen back as commands - and not all of it is harmless:
oracle@oel7l bin> oracle@oel7l bin> oracle@oel7l bin> oracle@oel7l bin> pwd -bash: oracle@oel7l: command not found oracle@oel7l bin> /u01/app/oracle/product/18.0.0/dbhome_1/bin -bash: /u01/app/oracle/product/18.0.0/dbhome_1/bin: Is a directory oracle@oel7l bin> oracle@oel7l bin> -bash: syntax error near unexpected token `newline' oracle@oel7l bin> oracle@oel7l bin> -bash: syntax error near unexpected token `newline' oracle@oel7l bin> oracle@oel7l bin> expdp help=y -bash: oracle@oel7l: command not found oracle@oel7l bin> oracle@oel7l bin> Export: Release 18.0.0.0.0 - Production on Tue Mar 16 17:55:35 2021 -bash: Export:: command not found oracle@oel7l bin> Version 18.3.0.0.0 -bash: Version: command not found oracle@oel7l bin> oracle@oel7l bin> Copyright (c) 1982, 2018, Oracle and/or its affiliates. All rights reserved. -bash: syntax error near unexpected token `c' oracle@oel7l bin>
It can actually get even worse! :-)
The mysterious case of a broken OS binary
In one system I once looked at, someone had managed to truncate /bin/ls
as root! It started as an exciting “OMG did they delete the entire filesystem?!” exercise:
root@oel7l bin> ls /home root@oel7l bin>
Ok, the home dir is gone? What about root?
root@oel7l bin> ls / root@oel7l bin>
What, the root directory is gone too? How could I even log in if root is gone? Is someone still deleting stuff? Did we get hacked?!
Luckily you don’t have to rely only on ls
to list file & directory names. Other than find
you could use shell’s built-in wildcard expansion:
root@oel7l bin> echo /* /bin /boot /dev /etc /home /lib /lib64 /media /mnt /opt /proc /root /run /sbin /srv /sys /tmp /u01 /u02 /u03 /u04 /usr /var root@oel7l bin>
The files are still there! We can further examine file metadata with file
or stat
commands (and the glob wildcard expansion is supported):
root@oel7l bin> file / /: directory root@oel7l bin> root@oel7l bin> stat / File: ‘/’ Size: 4096 Blocks: 8 IO Block: 4096 directory Device: fc00h/64512d Inode: 128 Links: 21 Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Context: system_u:object_r:root_t:s0 Access: 2021-03-16 19:14:49.953914433 -0400 Modify: 2018-09-14 19:01:29.695056064 -0400 Change: 2018-09-14 19:01:29.695056064 -0400 Birth: -
We even see the last modification timestamp with the stat
command (ls
gets its info from the same place as stat
under the hood).
So, there’s something wrong specifically with the ls
binary. The next logical command to run is which ls
. It would show you from which directory in your PATH it found a file with such name that is accessible to you and has an “x” bit set. It is less known, but which
also shows if someone has aliased your ls
command to shutdown
as a prank.
Let’s see where is my ls
binary:
root@oel7l bin> which ls /usr/bin/ls
Hmm, I had always thought that ls
was in /bin not /usr/bin. Let’s check whether the /bin directory is just a symlink:
root@oel7l bin> ls -ld /bin root@oel7l bin>
Oops, I had already forgotten that our ls
command is broken!
root@oel7l bin> file /bin /bin: symbolic link to `usr/bin' root@oel7l bin> root@oel7l bin> stat /bin File: ‘/bin’ -> ‘usr/bin’ Size: 7 Blocks: 0 IO Block: 4096 symbolic link Device: fc00h/64512d Inode: 773 Links: 1 Access: (0777/lrwxrwxrwx) Uid: ( 0/ root) Gid: ( 0/ root) Context: system_u:object_r:bin_t:s0 Access: 2021-03-16 17:38:36.389289020 -0400 Modify: 2018-08-19 11:29:23.860005567 -0400 Change: 2018-08-19 11:29:23.860005567 -0400 Birth: -
Both file
and stat
can tell us if we are dealing with a link and where it points to. Using the echo *
pattern trick I can see what other files do we still have in /usr/bin
:
root@oel7l bin> echo /usr/bin/ls* /usr/bin/ls /usr/bin/lsattr /usr/bin/lsblk /usr/bin/lscpu /usr/bin/lsinitrd /usr/bin/lsipc /usr/bin/lslocks /usr/bin/lslogins /usr/bin/lsmem /usr/bin/lsns /usr/bin/lsscsi root@oel7l bin> file /usr/bin/ls /usr/bin/ls: empty
Indeed, the /usr/bin/ls
file has been truncated to zero bytes by someone!
root@oel7l bin> stat /usr/bin/ls File: ‘/usr/bin/ls’ Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: fc00h/64512d Inode: 104356 Links: 1 Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Context: system_u:object_r:bin_t:s0 Access: 2021-03-16 17:57:58.944003009 -0400 Modify: 2021-03-16 17:57:54.594064500 -0400 Change: 2021-03-16 17:57:54.594064500 -0400 Birth: -
The file’s last modification date may give me extra insight into what/who might have accidentally overwritten the file. The .bash_history
shows a similar pattern of “someone” pasting in junk from the terminal screen:
root@oel7l bin> fc -l 1005 ls /var/run 1006 ls /var/log 1007 ls /var/local 1008 root@oel7l bin> ls /var/local 1009 root@oel7l bin> 1010 ls /var/local 1011 ls / 1012 which ls 1013 pwd 1014 ls -ld /bin root@oel7l bin>
You would now need to restore your ls
binary or reinstall it from the install package. Or, if not having a working ls
command drives you nuts while doing the restore operation, you could create a shell wildcard expansion-based temporary shell script or even an alias that just uses echo *
:-)
root@oel7l bin> alias ls=echo root@oel7l bin> ls /* /bin /boot /dev /etc /home /lib /lib64 /media /mnt /opt /proc /root /run /sbin /srv /sys /tmp /u01 /u02 /u03 /u04 /usr /var
How to avoid the file clobbering problem?
In bash
you can just set -o noclobber
! This will tell the shell to not clobber (overwrite contents) using the redirection operator.
Let’s check the current value, create a test file and enable noclobber, to see how it helps:
$ set -o | grep clob noclobber off $ $ echo hello > a $ $ cat a hello $ $ set -o noclobber $ set -o | grep clob noclobber on $
Ok, now let’s try to overwrite that file with redirection:
$ echo hello > a -bash: a: cannot overwrite existing file $
Nice! Bash doesn’t allow overwriting the file a
. Similarly, accidentally pasting in bad terminal output doesn’t overwrite the file:
$ badprompt> a -bash: a: cannot overwrite existing file $
However, the noclobber option is not very fool-proof, if your goal is to avoid accidental file modifications. For example, bash allows you to override the general noclobber setting and use >|
to say that you really do want to clobber that file:
$ badprompt>| a
bash: badprompt: command not found...
$
$ cat a
$
It’s less likely that someone’s prompt ends with >|
, but when you accidentally paste in random junk from terminal, these character sequences may well happen. And more, noclobber does not prevent one from appending to the end of the file with >>
:
$ set -o noclobber $ cat a $ $ echo hello > a -bash: a: cannot overwrite existing file $ $ cat a $
We couldn’t overwrite the existing empty file with anything so far, thanks to noclobber. But let’s try to append:
$ echo hello >> a $ echo hello >> a $ $ cat a hello hello $
No clobber, no problem!
This requires a little more exotic circumstances, the accidentally executed command must actually exist and print something into its standard output, for it to be appended to whatever filename comes after the >>
redirection. Some examples from the Linux world would be the mysql
or pcp
users (the executable command name == username in typical installations, thus some people may have their prompts looking like mysql>
or pcp>
when logged in as these users. Nevertheless, pasting in a bunch of such unlucky enough junk from the terminal with >>
in it may cause you to append random stuff to your existing binaries and scripts, instead of truncating them to zero. (Which one is better? 🤔)
What about the #
in a typical root prompt? Everything that comes after a # character, is gonna be a comment, right? I have allowed clobbering and am using just echo commands here to keep things simpler:
$ cat a hello hello $ $ echo zzz #> a zzz $ $ cat a hello hello
The above example didn’t try to overwrite my file, as the #> a
was treated as a comment. The yellow “zzz” you see repeated above, was just the standard output of the echo command, displayed on terminal screen (and redirection to a file did not kick in thanks to the comment # character). My file’s contents still say “hello”.
Now let’s make one last tiny change as I want my awesome-shell-prompt to be more compact. “I just trimmed some whitespace, I don’t think we need to test that”:
$ cat a hello hello $ echo zzz#> a $ $ cat a zzz# $
Oh crap, I should have tested that! A tiny change to whitespace changed the meaning of the echo command. Without the space between “zzz” and “#”, the shell thinks it’s all part of a single argument passed to the echo commadn (echo zzz#
) and the output of that command (zzz#
) was then redirected to my file a
.
Hopefully by now it is pretty evident that for the sake of sanity of our production systems (and ourselves), you shouldn’t use
>
in your shell prompts. However, this will not guarantee avoiding problems related to accidental pasting of other bad commands. For example, having an unintended space between the desired directory/filename prefix and * in this example - don’t run it!:
# rm -rif /some/app/dir/oldlog_ *
The above command will try to erase a single file
/some/app/dir/oldlog_
and anything that matches*
in the current working directory!
How to stay safe in shell?
This stuff is complex! Small mistakes can come back to bite you in a variety of ways. We are running critical production systems after all - how to avoid these problem reliably, so we wouldn’t be dependent on human errors not happening and always having luck?
I am not addressing any higher level solutions here, like various immutable infrastructure-as-code plaforms - they eliminate much of the “manual everyday typing” human error risk and shift the remaining risk to different layers.
If you actually need to log in to the servers manually, then the best solution I know is to not use privileged access when you don’t need it. This will reduce convenience, but will increase safety.
Limiting damage when accidentally pasting / typing in bad stuff
- Do not log in as root or sudo to an interactive root shell, even on development machines - they’re someone’s production too!
- In production, don’t even log in as the database/application owner at OS level (so that the malformed
rm
command above can’t erase important files) - In production, don’t configure universal passwordless
sudo
for yourself - In production, you can enable typical (diagnostic) commands via passwordless
sudo
(but everything else still needs a password) - In production, disable
sudo
credential caching (via /etc/sudoers) - When applying OS changes in production, write them all into a script, test it and run that exact script with
sudo
+ password sudo
and/etc/sudoers
aren’t meant only for gaining selective access toroot
user, but can be any other user too (dba, app owner)
This way, even if you do paste in some accidental junk from your clipboard, you can not mess something up under the different OS users, unless you are deliberate or get extremely unlucky.
Reducing the chance of accidentally pasting / typing in bad stuff
Going back to the original topic in this article - pasting in random stuff from your computer’s clipboard is bad!
- For me, the first step of avoiding some of the paste horror is to not use a terminal which immediately pastes the clipboard on just a right mouse click. I accidentally right click my mouse multiple times per day! I’ve used this terminal for the last 13 years as it gives me horizontal scrolling and doesn’t have the insane right-mouse-click pasting. I paste with a deliberate
CMD+V
, nothing else - I use various
notes.txt
style files. When working interactively in production servers (having live performance troubleshooting fun!), I tend to write any non-trivial command into an editor in a separate window first (often testing the commands out in a similar test environment) - I don’t copy&paste commands. I typically cut&paste back to the same text editor window, to make sure that the latest command was definitely put into the clipboard (I have had many occurrences of some browser or MS Word window just silently ignoring my copy commands).
- Once I am sure that the clipboard contains what I intended, I will immediately paste it to the production terminal window. Sometimes with a manually typed
#
prefix just to double-check it before hitting “go” - In extra paranoid mode, I double-check my clipboard contents even when visiting any browser windows (you know, because of pastejacking)
- You might think that you’d just mitigate all risk by always pasting stuff to a
vim
editor on the production server side, but what if your clipboard buffer contains an^ESC:q!\nsomeverybadcommand\n
or an unfortunate VIM macro? So I tend to cut & paste clipboard to my local editor, immediately before pasting it to the server
Summary
I hope that this was an entertaining read… and maybe it helps to explain that old mysterious incident when you had to restore only the mysql
or sqlplus
binary from a backup, while everything else seemed to be fine (other than your shell prompt with the >
suffix ;-)
This file clobbering problem is just one example of how accidental input can mess things up in your servers, even if you don’t hit some of the worst case rm -rf
or shutdown
commands. There are multiple reliable options for avoiding trouble and greatly reducing the local blast radius. Having good command prompt hygiene, especially in important systems, reduces the amount of headaches as well. And after that, you’ll need good backups.
If you want further reading - one more way to look into misbehaving shell binaries is explained in my earlier blog entry. It talks about troubleshooting sudden SSH logon delays via system call tracing using strace
. While strace
doesn’t trace the application’s internal user-space logic directly, it can still be very useful in cases like immediate, abrupt exits or in scenarios like “my config file changes are not picked up by the application”: