Work the Shell

Understanding Shell Script Shorthand

Dave Taylor

Issue #167, March 2008

Wherein we delve into the mysterious shell script authoring style of system scripts, deciphering common shorthand notations and exploring why they are a part of scripting. If you ever dig about in system scripts, you'll definitely want to read this column!

Oh happy day! I got an e-mail from a reader with a shell script question that didn't appear to be homework from a programming class or anything to do with hacking passwords. The reader wrote:

I am reading the scripts in the /etc/init.d directory. I am very new to such scripts and don't understand how they're written. In every script, there are statements like:
[ -x /usr/sbin/halt ] || exit 0
What is the meaning of this? Why is || used here?
Also, in the “stop” case of the halt dæmon init script, there is this sentence:
[ $RETVAL -eq 0 ] && touch /var/lock/subsys/$sname
I don't understand what these do. Can you explain?

With apologies to my old friend Larry Wall, this is what I call the “Perl syndrome” (though if we really want to go back in time, I saw this same problem with Algol-68 and PL/I, among others, and even worse in Ada)—obfuscated code because of the ability of programmers to abbreviate their code to make it shorter and, sometimes, more efficient.

Looking at the filesystem explains one of these structures. Check this out:

$ ls -l /bin/[

-r-xr-xr-x  2 root  wheel  46704 Sep 23 20:35 /bin/[*

$ ls -l /bin/test

-r-xr-xr-x  2 root  wheel  46704 Sep 23 20:35 /bin/test*

It may seem odd, but there's actually a file in the /bin directory in Linux that is called [, and it's synonymous with the test utility. You can learn about it by typing man test in a terminal window, but it's actually more complicated than that, because modern shells (such as Bash) have test built in to the shell code itself for performance reasons. So, there are actually three different versions of test.

If you do opt to use the [ version, the program requires that you have a matching ] for syntactic cleanliness (e-hygiene?). If you omit it, you'll get -bash: [: missing `]' as an error.

So, that first statement, [ -x /usr/sbin/halt ] || exit 0, can be unwrapped initially as a test, and a quick glance at man test reveals that the -x test is for checking whether the named file exists and is executable. Basically, this statement ensures that there's a /usr/sbin/halt script before it executes it to avoid any errors. This is a portability test. If you are missing that script, you have some serious problems, but a lot of system scripts are written this way.

Now, on to the || notation. Along with its partner &&, these two notations cause a lot of confusion for people delving into scripts, so let's start by reading what the Bash man page says about them (man bash):

command1 && command2 

command2 is executed if, and only if, command1 returns 
an exit status of zero. 

command1 || command2 

command2 is executed if and only if command1 returns 
a non-zero exit status.  

The return status of AND and OR lists is the exit 
status of the last command executed in the list.

Clear as mud, right? This will become more clear when we go back to the test man page and find out that “The test utility exits with one of the following values: 0 = expression evaluated to true, 1 = expression evaluated to false or expression was missing.”

So, the logic here is that the [] test is performed to see whether the script exists and is executable, and if it fails, the exit 0 is performed. How do you know if it fails? The test statement would return an exit value of 1.

Now, let's look at the second statement with this in mind. You asked about this statement:

[ $RETVAL -eq 0 ] && touch /var/lock/subsys/$sname

Again, the [ is a shorthand notation for the test application. RETVAL is a system variable, and the -eq is a numeric test for equality. In this case, the return value again determines whether the test is true or false. If it's true (a zero return value), the touch command is used to set what's called a semaphore—a lock file to indicate to other scripts that the $sname subsystem is locked up and unavailable to modify.

This is actually a pretty sloppy way to set a semaphore because it's not atomic. There is a distinct likelihood that in the interim between the first RETVAL test and the touch command, the script will be swapped out for a few milliseconds and another script run. This means that two scripts possibly could both believe they've locked the file—something called a race condition in computer science theory, and something that is obviously not a good thing.

Anyway, I'm not supposed to be debugging system scripts. So, suffice it to say that the purpose of the statement is to test the return value of a previous command (there's probably a statement like RETVAL=$? on the previous line, as $? is shorthand for the return value of the previous shell command). If the test is true, the temporary file is “touched” (that is, it's created and given a creation timestamp of the current date and time).

Later in the script, there is undoubtedly a statement like rm -f /var/lock/subsys/$sname, and in fact, a cleaner way to write it would be to trap exit conditions and make sure that the lock file isn't left around, even if the script errors out. This is done with the trap shell command. Error condition 0 is a standard termination, so one clean way to write this is as follows:

trap "/bin/rm -f /var/lock/subsys/$sname" 0

This provides a lot of flexibility, because you can capture any of the dozens of possible signals like SIGINT (interrupt) or SIGHUP (hangup).

Anyway, you're not the first to be baffled by system scripts, but as you can see, a bit of persistence reveals all.