Manipulating PDF files

Docile Documents


We review a variety of tools for the command line that work on PDF files.

By Bruce Byfield

Portable Document Format (PDF) is one of the standards of the Internet. Although it's now less popular than in the late 1990s, thanks to improved browser support for Microsoft Office and Open Document formats, it remains in widespread use, especially in business.

Under these circumstances, sooner or later, many users are going to encounter a PDF file that they will want to edit or from which they will need to extract information.

Fortunately, PDF is an open standard and is closely related to Postscript, which free software has long supported. Consequently, you can choose from an enormous arsenal of tools for manipulating PDFs, especially at the command line. If these tools sometimes lack the functionality or polished interfaces of proprietary tools like Adobe Acrobat, they can boast unique features almost as often.

Should you need to edit a PDF on the desktop, you can install the Sun PDF Import Extension for OpenOffice.org [1] or PDFedit [2]. For many users, these desktop tools might be all they need to handle most aspects of PDFs. However, the functionality of the desktop tools almost seems impoverished compared with the wealth of tools available at the command line.

Although generating a desktop format from the command line might seem strange, PDF command-line tools have some advantages. For example, they can process large numbers of files at the same time and offer a wider range of options than their desktop counterparts. Although you might need to display man pages while using them, whether you are converting files to PDFs or making or editing PDFs, you will find at least two dozen PDF tools at the command line.

The most complete of these tools are PDFjam [3] and pdftk [4], although you might prefer to use one of the easier to learn tools that are dedicated to a single purpose instead.

Converting Files and Extracting Information

If you are a purist, you can work with PDFs via Ghostscript or command-line applications like LaTeX. However, because both take time to learn, most users prefer to use the numerous shell scripts available for manipulating PDF files instead.

Many PDF tools are simple shell scripts. In Ubuntu and Debian, a half dozen of the most popular are available in the poppler-utils package. However, many others are unbundled, so use apt-cache search or yum search (depending on whether your distro uses .deb or .rpm packages) to see what else is available.

A large number of these tools are for converting between PDF and another format. Their names are self-explanatory - for example, pdftops, pdftohtml, pdftotext, pdf2svg, chm2pdf, ps2pdf, and wkhtmltopdf. These tools are especially useful when you want to extract the contents of a PDF file for editing. After conversion, you can edit much more freely than you can with Sun's PDF Import Extension and then use another tool to create a new PDF.

Other basic tools give you information about PDF files. For example, the command pdfinfo file gives you basic information about a file (Figure 1). The same command can also be used with the -upw option to set a user password or with -opw to add an owner's password, both of which override existing security options. You can use pdfressurect -i to receive the same information.

Figure 1: The pdfinfo script gives you a summary of the attributes for a PDF file.

Should you want more information, you can use pdffonts to see a list of PostScript and TrueTypes fonts used by a PDF, as well as whether they are embedded (Figure 2). Also, you can extract images from a file and save them in JPG format with the command pdfimages -j pdf-file target-directory.

Figure 2: A summary of the fonts used in a file, thanks to pdffonts.

Editing PDFs with PDFjam

Many of the available PDF tools have a single purpose. By contrast, PDFjam [3] is a front end for pdfpages, a popular LaTeX script [5]. PDFjam requires a large number of dependencies, including many for LaTeX, but it is ideal if you need to edit either the file attributes or the format of an existing PDF file.

PDFjam uses an unusual command format:

pdfjam [options] file1 'pages' file2 'pages'

Any page reference refers to all files to its left and to the right of a previous page reference, so that file1 file2 '2,3,4' is the same as file1 '2,3,4' file2 '2,3,4', and both would print pages 2 through 4 of both files. Furthermore, page references are generally listed with a comma between pages. However, you can specify all pages with a hyphen (-) or an empty page with curly braces ({}).

Moreover, the default behavior is to take all the input files and produce a single output file from them. You can specify the exact location of the output file with the option --outfile path. If you do not want to merge the input files, you must add the --batch option, which will produce a separate output file for each input file.

To further customize your results, use --suffix string to add the same suffix at the end of each output file. Additionally, you can use --keepinfo so that output files retain the same PDF file attributes as the input files. Alternatively, you can use --pdftitle, --pdfauthor, pdfsubject, and pdfkeywords, each followed by the string of your choice to change the file attributes. If all you want is to remove the PDF file attributes, then use the --no-keepinfo option.

PDFjam also gives control over the appearance of the output file. For instance, you can print PDFs on two sides with --twoside or avoid duplex printing altogether with --no-two-sided. Similarly, you can use --page format to specify a page size (such as A4), use -- papersize {width, height} to set a custom size, or --scale decimal to specify a percentage size, such as 0.8 for 80 percent. You can even set an RGB color value for the background with --pagecolor [red-value, green-value, blue-value].

If you find yourself using PDFjam repeatedly, you can save time by writing a configuration file called pdfjam.conf like the one on the project's homepage [6]. A global configuration file can be placed in /etc, /usr/share/etc, or usr/local/etc, whereas a personal configuration file can be put in .pdfjam.conf in an account's home directory. After you have created a configuration file, you can override the default options with the option --vanilla.

PDFjam can be complicated to learn, so you should run it at first in its usual verbose mode to see if you are making mistakes. Later, if you prefer, you can use the --quiet option to spare yourself seeing what the script is doing. The command in Figure 3 has PDFjam take all the pages of cats.pdf and create a new output file called cats-AD.pdf, change the author attribute, and make the background color lime green.

Figure 3: PDFjam can be a complicated program to use, so running it in its default verbose mode is a good idea.

Editing PDFs with pdftk

If PDFjam lacks the features you need, then another all-in-one script to try is the PDF Tool Kit, or pdftk, as it is better known [4]. The project homepage explains its function by saying that, "If PDF is electronic paper, then pdftk is an electronic staple remover, hole-punch, binder, secret decoder ring, and X-ray glasses."

To run PDFjam, you need a version of Java, but if you are concerned about software freedom, the version can be any free version, from the basic GCJ [7] to OpenJava [8]. PDFjam and pdftk overlap to a certain extent - both with each other and with smaller scripts - but each has features that the other lacks.

Like PDFjam, pdftk has its own unique format. The most noticeable difference is that options are placed at the end of the command as completions, rather than immediately after the basic command. For example, to merge several PDF files into one, pdftk's format is pdftk file1 file2 cat output output-file. Here, cat is an abbreviation of concatenate, so the output file will contain the input files in the order you list them.

To complicate matters even more, pdftk also supports a system of what it calls "handles" - labels for each file that enable you to specify different options for each file within the same command. Handles are defined when a file is entered, so the start of a command might be pdftk A=file1 B=file2. Once handles are defined, you can then use them later in the command. For example, if you wanted to combine pages 1 through 3 from file A with pages 4 through 9 from file B, then the entire command would be A=file1 B=file2 A1-3 B4-5 A9 output output-file. Alternatively, if you wanted to include all the pages in file A, then you could simply type A instead of A1-3.

Handles can be complicated, but their advantage is that you can make separate pages uniform. As the examples on the homepage demonstrate, you can, for example, unencrypt a file on the fly as you combine it with another, or rotate one file but not another.

What else can pdftk do? It can end the command with attach_files file to add a file to a PDF as an attachment, or unpack_files to extract attachments into the current folder. If you want to split a PDF into a series of single-page PDFs, you can simply add burst. To view the information about a single file, you can use dump_data.

If you are concerned about security, you can encrypt the output file with encrypt_40bit or encrypt_128bit. Similarly, you can add a user password with user_pw password or an owner password with owner_pw password. To identify a file, you can use stamp or multistamp to use another PDF as a watermark, either specifying the file or waiting to be prompted. If you want to further control how the PDF is used, you can add allow permissions. The permission will be one of a number of features, including AllFeatures, Printing (for top-quality printing), DegradedPrinting, ModifyContent, or Modify Annotations.

By far the most useful feature of pdftk is that you can use it to fill in PDF forms, which have always been a weak spot in free software's ability to manipulate PDFs. Using the completion generate_FDF, you can create a Form Data Format file to edit. Then, you can run pdftk a second time, using fill_form FDFfile. Another possibility is to run pdftk with fill_form and wait to be prompted for the entries in each form field. Neither alternative is as convenient as typing directly into the form fields, but both are considerably better than nothing.

Unlike PDFjam, pdftk does not describe its operations by default as it runs, except for reporting errors. If you want more feedback, then the last completion you should add to a command is verbose (Figure 4). Even then, pdftk does not print any notification that it has completed all operations - it simply returns you to the command prompt. You might want to add the do_ask completion, so that pdftk does not do anything such as automatically overwrite an existing file when it creates the output file.

Figure 4: If you want to see what pdftk is doing, put it in verbose mode (top); otherwise, you only get the error messages (bottom).

As an application, pdftk can be complicated to learn. If you are a diehard desktop user, you might want to try its graphical version, PDF Chain (Figure 5). Be warned, however, that it can be as hard to learn as the command-line pdftk - although it does give you a better view at a glance of what pdftk can do.

Figure 5: The graphical version of pdftk can be as difficult to learn as the command-line version.

Learning the Tools

The command-line PDF tools are numerous enough to keep you busy learning for several hours. Even then, you might have to refresh your memory, unless you use the commands all the time. I suggest choosing PDFjam or pdftk to concentrate on or - as far as possible - using the smaller, more dedicated tools. The dedicated tools, especially those bundled as poppler-utils, tend to have similar options. Even when they don't, however, the options are fewer than those in PDFjam and pdftk, to say nothing of being more conventional in format. For these reasons, the dedicated tools should be easier to use.

And what about actually creating PDFs from the command line? That is a topic for another column.

INFO
[1] PDF Import Extension: http://extensions.services.openoffice.org/project/pdfimport
[2] PDFedit: http://pdfedit.petricek.net/en/index.html
[3] PDFjam: http://www2.warwick.ac.uk/fac/sci/statistics/staff/academic/firth/software/pdfjam/
[4] pdftk: http://www.accesspdf.com/pdftk/
[5] pdfpages: http://www.tex.ac.uk/CTAN/macros/latex/contrib/pdfpages/pdfpages.pdf
[6] Sample pdfjam.conf: http://www2.warwick.ac.uk/fac/sci/statistics/staff/academic/firth/software/pdfjam/pdfjam.conf
[7] GCJ: http://gcc.gnu.org/java/
[8] OpenJava: http://www.csg.is.titech.ac.jp/openjava/