LJ Archive

Filters: Doing It Your Way

Malcolm Murphy

Issue #27, July 1996

A look at several of the more flexible filters, probrams that read some input, perform some operation on it, and write the altered data as output.

One of the basic philosophies of Linux (as with all flavours of Unix) is that each program does one particular task, and does it well. Often you combine several programs to achieve something, either at the shell prompt or in a script, by piping the output of one program into the next. I'm talking about things like

ls -l | more

and

ps -auxw | \
  grep netscape >> people.who.should.be.working

But what if the output of one program isn't in the format needed for the next? We need some way of processing the output of one program so that it is ready for the next.

Fortunately, there are many Linux programs that do this job: read some input, perform some operations on it, and write the altered data as the output. These programs are called filters. Some filters do quite limited tasks, such as head, grep and sort, whereas others are more flexible, such as sed and awk. In this article, we're going to look at several of these more flexible filters, and give several examples of what can be done with them.

The name “sed” is a contraction of stream editor; sed applies editing commands to a stream of data. A common use for sed is to replace one text pattern with another, as in

sed 's/Fred/Barney/g' foo

This command takes the file foo, changes every occurrence of Fred to Barney, and writes the modified version to standard output.

Note that in this example we have placed the actual sed commands inside single quotes. Sed doesn't require that commands be quoted this way, but you will need to use quotes if the sed command includes characters that are special to the shell, such as $ or *. This example doesn't have any special characters, so we could just as easily have left out the quotes. Try it and see.

Without the input file foo, sed reads from standard input, so we could achieve the same result with the command

sed 's/Fred/Barney/g' < foo

or

cat foo | sed 's/Fred/Barney/g'

Note that the first two versions are generally preferred to the third. Using cat just to send input into a pipe creates an extra process which can often be avoided.

We also have to consider the output. By default, the results appear on standard output, but this isn't always what we want. One option is to pipe the output through a pager, for example

sed 's/Fred/Barney/g' foo | more

or to redirect it to a file

sed 's/Fred/Barney/g' foo > bar

While it is often tempting to write

sed 's/Fred/Barney/g' foo > foo

the only thing this achieves is to delete contents of the file foo! Why? Because the first thing the shell does with this command is to open the file foo for output, destroying what was there already. When it tries to read from foo, there is nothing there to read. The result is an empty file. This is an easy mistake to make when redirecting output in this way, so do be careful.

Awk is a bit more flexible than sed; it is a full-fledged programming language in its own right. However, don't let that put you off. Writing simple programs in awk is surprisingly easy, and it often doesn't feel like a programming language [See page 46 of Linux Journal issue 25, May 1996—ED]. For example, the command

awk '{print NR, $0}' foo

prints the file foo, numbering each line as it goes. Awk can also read its input from a pipe or from standard input, exactly like sed, and also writes on standard output, unless you redirect it. The bit between the quotes (which are necessary, since the {} characters are also special characters to the shell) is the awk program. I said they can be simple, didn't I? An awk program is simply a sequence of one or more pattern-action statements, in the form

pattern { action }

Each input line is tested against each pattern in turn. When an input line matches a pattern, the corresponding action is performed. Either the pattern may be empty, in which case every line matches, or the action may be empty, in which case the default action is to print the line.

In the example above, the pattern was empty, so every line matched. The action was to print NR, which is a built-in awk variable containing the number of lines read so far, and then print $0, which is the current line.

Going On

Now that we've seen the basic idea behind sed and awk, we're going to look at some examples. The best way to learn something is to actually do it, and I recommend that you try out some of these examples yourself as you go along, possibly even with one eye on the man pages. We certainly aren't going to cover everything that sed and awk can do, but you will, it is hoped, have more confidence to try things out yourself once you've finished reading this article.

Our first example is to remove all the spaces from a document. This is easily achieved using sed:

sed 's/ *//g' foo

This is like the earlier example with Fred and Barney, only here we have used a regular expression: ' *' (the quotes are included so that you can see the space that is part of the regular expression). sed's s (for substitute) command using regular expressions just like grep. The regexp ' *' matches one or more spaces, which are replaced with nothing—they are deleted. This command doesn't deal with tabs, as it stands, but you could modify it to match one or more occurences of either a tab or a space:

sed 's/[ {tab}][ {tab}]*//g' foo

Double Spacing

Next, we'll think about doublespacing a text file. We can do this using sed's substitute command by replacing $ (the regexp for the end of a line) with a newline character (which we have to quote with a backslash)

sed 's/$/\
/' foo

Note that in this example, there isn't a g before the second quote, unlike all the earlier examples. The g is used to tell sed that the substitution applies to all matches on each line, not just the first match on each line, which is the default behaviour. In this case, since each line only has one end, we don't need the g.

Another way of doing this in sed would be:

sed G foo

If you look at the man page for sed, it says that G “appends a newline character followed by the contents of the hold space to the pattern space”. The pattern space is the sed term for the line currently being read, and we don't need to worry about the hold space for now (trust me, it will be empty), so this command does exactly what we want.

It's quite easy to doublespace in awk, using the print statement we saw earlier:

awk '{print $0; print ""}' foo

Here, the pattern is empty again, matching every line, and the action is to print the entire line, $0, then to print nothing, "". Each print statement starts a new line, so the combined effect of the two commands is to doublespace the file.

Awk actions can (and often do) involve more than one command in this way, but it isn't strictly necessary here. Awk provides a formatted print statement that gives more control over the output than the basic print statement. So we could get the same result with:

awk '{printf("%s\n\n",$0)}' foo

The first argument to the printf statement is the format, a description of how the output should appear. The format can contain characters to be printed literally (none in this example), escape sequences (such as \n for a newline), and specifications. A specification is a sequence of characters beginning with a % that controls how the rest of the arguments are printed. For each of the second and subsequent arguments, there must be a specification. In this example, there is one specification, %s, which prints a character string. The value associated with that specification is $0; the entire line. Unlike print, printf doesn't automatically start a new line, so two \n's are needed: one to end the original line and one to insert a blank line.

For this seemingly simple example—doublespacing a file—we came up with four different solutions. There is always more than one way of solving a problem, and it normally doesn't matter which one you take. The point is that you usually write an awk or sed program to do a particular task as the need arises, then discard it. You don't necessarily want the “best” solution (whatever that means), you just want something that works, and you want it quickly.

Being Selective

Another quite common task is to select just part of the input. Suppose we want the fifth line of the file foo. In awk, this would be

awk 'NR==5' foo

which prints the line when NR, the number of lines read so far, equals 5. The sed equivalent is

sed -n 5p foo

By default, sed prints every line of input after all commands have been applied. The -n option suppresses this behaviour, so we only get the line we specifically ask for with the p command. In this case, we asked for the fifth line, but we could just as easily specified a range of lines, say the third to the fifth, with:

sed -n 3,5p foo

or, in awk

awk 'NR>=3 && NR<=5' foo

In the awk version, the && means “and”, so we want the lines where NR>=3 and NR<=5, that is, the third through the fifth lines.

Yet another approach would be to combine head and tail

head -5 foo | tail -3

which uses the head program to get the first 5 lines of the file, and the tail program to only pass the last three lines through.

Yet another common problem is removing only the first line. Remember how the $ character means the end of the line when it is used in a regular expression? Well, when you use it to specify a line number, it means the last line:

sed -n '2,$p' foo

In awk, you can use != or > to get the same result from either of these commands:

awk 'NR>1' foo
awk 'NR!=1' foo

When Line Numbers Are Not Enough

Selecting part of a file using line numbers is easy enough to do, but often you don't know the line numbers you want. Instead, you want to select lines based on their contents. In awk, we can easily select a line matching a pattern, with

awk '/regexp/' foo

Which causes all lines containing regexp to be printed. There is a direct sed equivalent of this:

sed -n '/regexp/p' foo

Of course, we can also use grep to do this kind of thing:

grep 'regexp' foo

but sed can also handle ranges easily. For example, to get all lines of a file up to and including the first line matching a regexp, you would type:

sed -n '1,/regexp/p' foo

or to get all lines including and after the first line matching regexp:

sed -n '/regexp/,$p' foo

Remember that $ means the last line in a file. You can also specify a range based on two regexps. Try

sed -n '/regexp1/,/regexp2/p' foo

Note that this prints all blocks starting with lines containing regexp1 through lines containing regexp2, not just the first one. If there isn't a matching regexp2 for a line containing regexp1, then we get all lines through to the end of the file.

Now we can select some part of the input, based on a regular expression.

We might want to delete some lines that contain a certain pattern. The d command does just that:

sed '/regexp/d' foo

deletes all lines that match the regexp. Or, we might want to delete a block of text:

sed '/regexp1/,/regexp2/d' foo

deletes everything from a line that contains regexp1, up to and including a line that matches regexp2. Again, sed will select all blocks of text delimited by regexp1 and regexp2, so there is a danger we could delete more than we want to.

Inserting text at a given point is possible, too. The command

sed '/regexp/r bar' foo

inserts the contents of the file bar after any line that matches the regexp in the file foo.

Now, we can combine these last two commands to replace a block of text in a file with the contents of another file. We do it like this:

sed -e '/START/r bar' -e '/START/,/END/d' foo

This finds a line containing START, deletes through to a line containing END, then reads in the contents of the file bar. Because the r command doesn't read in the file until the next input line is read, the d command is executed before the new text is read in, so the d command doesn't delete the new text, as one might expect, looking at this command. The -e option tells sed that the next argument is a command, rather than an input file. Although it is optional when there is only one command, if we have multiple commands, they must each be preceded with -e.

Columns

These examples have mostly been line oriented, but we are just as likely to want to deal with columns of data. The filter cut can select columns of data. For example, to list the real names of all the users on your system, you could type

cut -f5 -d: /etc/passwd
The 5 argument after -f tells cut to list the
fifth column (where real names are stored), and the -d
flag is used to tell cut which character delimits the
field—in the case of the password file, it's a colon. To get
both the username (which is in the first column) and the real
name, we could use
cut -f1,5 -d: /etc/passwd

Awk is also good at getting at columns of data, we could do these tasks with the following awk commands:

awk -F: '{print $5}' /etc/passwd

and

awk -F: '{print $1,$5}' /etc/passwd

where the -F flag tells awk what character the fields are delimited by. (Do you see the difference between using cut and using awk for printing more than one field? If not, try running the commands again and looking more closely.)

One advantage of using awk is that we can perform operations on the columns.

For example, if we want to find out how much disk space the files in the current directory take up, we could total up the fifth column of the output of ls -l:

ls -l | grep -v '^d' | \
  awk '{s += $5} END {print s}'

In this command, we use grep to remove any lines that begin with d, so we don't count directories. We chose grep, but we could just as easily have used awk or sed to do this. One pure awk solution could be:

ls -l | awk '! /^d/ {s += $5} END {print s}'

where the awk program only totals the fifth column of lines that don't begin with a d—the exclamation mark before the pattern tells awk to select lines which don't match the regular expression /^d/.

Working with Filenames

Often, many files have the same basic name, but different extensions. For example, suppose we have a TeX file foo.tex. Then we could very well have associated files foo.aux, foo.bib, foo.dvi, foo.ps, foo.idx, foo.log, etc. You might want a script to be able to process these files, given the name of the file foo.tex. The basename utility:

basename foo.tex .tex

will give you the basic name foo. If we have a shell variable containing the name of the TeX file, we might use

basename ${TEXFILE} .tex

Again, there is more than way of getting the basename of a file: you could do this in sed using:

echo ${TEXFILE} | sed 's/.tex$//'

Whichever approach we take, we can construct the name of the other files once we know the basic name. For example, we can get the name of the log file by:

LOGFILE=`basename ${TEXFILE} .tex`.log

This is very useful: I use vi for most of my editing, and it allows you to get at the name of the file currently being edited in a macro; % is replaced with the filename. If I'm editing a TeX file foo.tex, and I want to preview the dvi file using xdvi, I can transform % (let's call it foo.tex) into foo.dvi automatically in a macro

:!xdvi `basename % .tex`.dvi &

I can bind this to a function key and never worry about the name of the dvi file when I want to view it, by adding this line to my .exrc file:

map ^R :!xdvi `basename % .tex`.dvi &^M^M

The ^R and ^M characters are added by typing Control-V Control-R and Control-V Control-M, respectively, assuming you are editing your .exrc file with vi.

Conclusion

In this article, we have looked at the some of the tools available in Linux for filtering text. We have seen how using these filters we can manipulate the output of one command so it is in a more convenient form to be used as the input for another program or to be read by a human. This kind of task arises naturally in a lot of shell-based work, both in scripts and on the command line, so it is a handy skill to have. Although the man pages for sed and awk can be a little cryptic, solutions to problems can often be very simple. With a little practice, you can do quite a lot.

LJ Archive