By Daniel Stender
Many tools exist for creating quality e-books from documents or book scans on Linux. One popular approach is to use Sane and its front ends to read the documents. With a little help from ImageMagick, you can then conveniently compile the results in a batch process. The unpaper processor provides a useful service, especially in automated post-processing and upgrading of book scans.
After this, it makes sense to package the images in a DjVu (pronounced "déjà vu") or PDF file and add bookmarks. DjVu is a container format for raster graphics developed by AT&T. This potential alternative to PDFs has a more effective and faster compression algorithm. Viewers are available for all major platforms. The command line provides a convenient approach.
As the pièce de résistance, you can add an OCR (optical character recognition) layer to a book, if needed, to add the ability to search the text. Linux also has open source tools for doing this. The free software in this field is so professional that it compares well with commercial products.
The Sane [1] scanner suite is extremely popular and very much part of the regular lineup for most distributions. Thanks to the XSane [2] front end, you can easily create a book scan from a series of individual scans. For optimum OCR results, it is a good idea to scan text with a resolution of no less than 300 dpi.
Grayscale scans - in, say, the PGM ("Portable Graymap") format - must be translated into monochrome (PBM, "Portable Bitmap"). The program automatically enumerates the scans; it also will scan sections if needed and store them rotated through 90 degrees. ImageMagick [3] is useful for rotating, cropping, and editing, too. This collection of programs includes a variety of tools (the following example uses convert
) that are easily integrated in shell scripts:
$ for i in *pgm; do convert $i -rotate 90 -verbose ${i%pgm}pbm; done
This one-liner rotates all the PGM files in the working directory through 90 degrees and converts them into PBM at the same time. You can reduce the size of the thick, black stripe in the middle of the scan, which is caused by the book's binding, by setting a lower -white-threshold
value for conversion. In production use, values between 25 and 35 percent have worked well.
After converting, it makes sense to check the scans with a graphic viewer. You can use Geeqie [4] for this. The program shares with many other image viewers the ability to right-click (or press a shortcut) to load an image into GIMP [5]. When you get there, you can remove fly dirt, manual notes, or the like.
Unpaper [6] by Jens Gulden is an intelligent post-processing tool for document scans. The current 0.3 version is available with Debian 5.0 "Lenny," Ubuntu 9.10 "Karmic Koala," openSUSE 11.2, or Fedora Version 10 or newer. The tool removes dirt and soiling captured by the scan along with the black line in the middle. If needed, it will split double pages into single pages, rotate, deskew, and center blocks of text. This makes unpaper the perfect choice for processing scans made from quick-and-dirty photocopies (Figure 1).
To be prepared for any eventuality, you can fine tune unpaper's functions by setting a number of options. That said, the defaults are normally fine for good results (Figure 2). As a genuine command-line tool, unpaper naturally has the ability to batch process files. To avoid overwriting the output files, your best approach is to redirect the output to a different directory:
$ unpaper --layout double --output-pages 2 %04d.pbm out/%04d.pbm
In this example, the input is a double-page book scan. The output comprises two individual pages. The format string %04d
gives the files names comprising four digits (0001.pgm
). If it does not already exist, you will need to create the directory for ./out
before you launch the command. If the tool skips lines during processing, try setting the mask scan (-ms
) to a higher value, such as 175,175
.
A package of Djvulibre [7] tools (djvulibre-bin on Debian and Ubuntu, djvulibre on openSUSE and Fedora) help users manipulate DjVu files. With the use of a shell loop, you can run the processed book scans through the Djvulibre monochrome encoder /cjb2
) (Listing 1, line 1). Then, use djvm
to merge the containers (line 2). Then you can view your new e-book. To do so, use a viewer such as Djview. Debian-based systems and Fedora will call the package djview4; openSUSE calls it djvulibre-djview4.
Listing 1: Processing Book Scans |
01 $ for i in *pbm; do echo $i; cjb2 $i ${i%pbm}djvu; done 02 $ djvm -c myebook.djvu *djvu |
Next, you can add bookmarks to the DjVu file as needed (in DjVu speak, this is referred to as an outline). To do so, create an outline file in text format using an editor. The file should look something like Listing 2. Then, use the djvused
tool to package the file in the DjVu container you created previously:
$ djvused myebook.djvu -e 'set-outline myebook.outline' -s
It doesn't get much easier.
Listing 2: Creating a Bookmarks File |
01 (bookmarks 02 ("Title" "#1") 03 ("Body" "#5" 04 ("Chapter 1" "#5") 05 ("Chapter 2" "#10") 06 ("Chapter 3" "#15") 07 ) 08 ) |
To create PDF files from scans on Linux, you first need to convert the PBM files to TIFFs. Again, a small loop is all it takes (Listing 3, line 1). Then, create a multi-page TIFF from the individual images (line 2) and convert the TIFF into a PDF with a final command (line 3).
The two tools required for doing this are part of the standard libtiff-tools package (libtiff3 on openSUSE, libtiff on Fedora). The tiff2pdf
command has -j
and -z
options that let you use the JPEG or ZIP compression algorithms.
Listing 3: Converting PBM Files to TIFFs |
01 $ for i in *pbm; do convert $i -verbose ${i%pbm}tif; done 02 $ tiffcp *tif myebook.tif 03 $ tiff2pdf -o myebook.pdf myebook.tif |
The Tesseract [8] OCR engine and the OCRopus [9] wrapper let users create high-quality book scans with an OCR layer on Linux. Besides the main Tesseract program (tesseract-ocr, tesseract on openSUSE and Fedora), you might need to set up some language packs (such as tesseract-ocr-deu, tesseract-data-deu on openSUSE, tesseract-langpack on Fedora). The package manager will show you the dependencies and offer to resolve them during the install.
Although you could basically use Tesseract without any other tools for the text scans, it makes sense to control the engine via the OCRopus wrapper (0.3.1 on Debian Squeeze and Ubuntu "Lucid Lynx"; 0.2 on Ubuntu Karmic; not included for openSUSE and Fedora) to perform layout analyses of the scans. During the OCR process, the software not only captures the text but also the positioning data during the scan. This makes it possible to highlight the word for which you are searching in the e-book.
OCRopus is also useful for scripting. Use ocroscript
for this and pass in the Tesseract language file for optimum results:
$ for i in *pbm; do echo $i; ocroscript recognize --tesslanguage=deu --charboxes *pbm > ${i%pbm}hocr; done
Both OCRopus and Tesseract can handle grayscale scans and images in the files. The output takes the form of OCR files in hOCR [10] format, which is an XHTML-based data format that also includes the layout of the identified text. You can edit the output with a text editor if needed. The next step is to integrate the hOCR file in the e-book as an OCR layer.
In the case of PDF-based e-books, you can use the hocr2pdf
tool from the ExactImage suite [11] (Debian "Squeeze" and Ubuntu Karmic) to add the hOCR metadata created by OCRopus to the OCR layer. To do so, you need to convert the PBM files back into TIFFs after the OCR process. Then, combine the data with the matching hOCR files in individual PDFs:
$ for i in *tif; do hocr2pdf -i $i -o ${i%tif}pdf < ${i%tif}hocr; done
The layout files here have some difficulties - the highlighting is offset - but people are aware of the problem [12]. After the PDFs are created, finally, you can package the individual PDFs in a container using pdftk
[13]:
$ pdftk *pdf cat output myebook.pdf
If you are creating e-books in DjVu format, there is a useful helper called ocrodjvu by Jakub Wilk [14] (Debian Squeeze and Ubuntu Lucid). Ocrodjvu takes over control of OCRopus and automates the whole process of text extraction and insertion into an existing DjVu-based e-book; as an alternative to Tesseract/OCRopus, you can also use Cuneiform (see the "Open Source OCR" box).
As of this writing, grayscale scans are not supported; however, there are plans to change this. The ocrodjvu package also includes the hocr2djvu converter that lets you convert hOCR files into the DjVu metadata format. Then, you can use Djvused to insert the results into the individual files, as with the PDFs, before you run Djvm to create a container.
Today's OCR engines will output fairly usable text given optimum conditions such as clean scans and a sufficient resolution. Of course, for a perfect finish, you could always edit the data gleaned by OCRopus or Tesseract using a text editor to remove the final bugs, but this can be a very time-consuming process.
What does make sense is to run a spell checker against the final results. GNU Aspell [15] (currently version 0.6.60) gives you the ability to exclude all the XHTML tags from the spell check. To check all the hOCR files in a directory, you can launch Aspell in a loop:
$ for i in *hocr; do aspell --lang=de --mode=html -c $i; done
Of course, this assumes that you installed the correct Aspell language package and specified the use of this package.
Open Source OCR |
Currently, four OCR engines are freely available: Cuneiform by Cognitive Technologie from Russia (currently version 0.9), GOCR by Jörg Schulenburg (0.48), Ocrad as part of the GNU project (0.19), and Tesseract. In contrast to the other programs, Cuneiform and Tesseract include recognition data for a variety of languages, thus increasing the accuracy. Tesseract (currently version 2.04) was originally programmed by Hewlett-Packard. Google is currently developing the software under the Apache License 2.0. It is a powerful OCR engine used for Google Books. Again, Tesseract is an engine without a statistical language module or layout analysis, which - along with the ability to recognize Chinese - are on the roadmap for the next major release. To complete the OCR process with layout analysis and language modules, staff at the Deutschen Forschungszentrum für Künstliche Intelligenz (DFKI - German Research Center for Artificial Intelligence) in Kaiserslautern programmed the OCRopus wrapper for use with Tesseract (latest stable: 0.4.3). |
Front ends are now available to handle the procedures described in this article for creating e-books through to the finished DjVu or PDF. As a powerful GUI-based application, gscan2pdf [16] combines all the components from the scanning programs, through unpaper, to OCR in a single interface (version 0.9.29 on Debian Stable, Ubuntu Karmic and openSUSE 11.2). The latest version, gscan2pdf 0.9.30 (Debian Squeeze and Fedora 12), also includes a port for OCRopus.
Scan Tailor [17] is another candidate, although it is not quite as luxurious. For larger projects that include archiving, it might be worth your while to investigate the free, major e-document servers such as ArchivistaBox [18] or openDias [19]. These tools are best suited to large document volumes. Having said this, development is tightly tied in with the latest free OCR software that was used in our workshop.
THE AUTHOR |
Daniel Stender is a PhD student of Classical Indology and has used Debian on the desktop to the exclusion of all other operating systems for years. He is interested in the use of open source applications in Sanskrit philology. You will find his blog at http://www.danielstender.com/granthinam. |