Finding files with Recoll

Digging In


Whether you're looking for a letter to the Internal Revenue Service or an email from an online trader, the Recoll desktop search machine will help you find it with just a few mouse clicks.

By Tim Schürmann

JCOLL, Fotolia

Even if you keep to a strict system of filing data and documents in a well-thought out directory structure, you're bound to lose track of a file sooner or later, probably when you need it most. The file manager's search function might help here, but unfortunately, it just checks for file names. If you're lucky, you also might be able to check the content of text files, but that's not much help if your OpenOffice file with the letter to the IRS is stored as 12112005fa. After grinding away at your hard disk for ages, the results are likely to be disappointing.

Enter Recoll [1], your personal full-text search engine (not to be confused with the Rekall database). Recoll searches for the keys you type, both in external attributes such as the file name and in the documents themselves. Just as in other desktop search engines like Beagle [2], Recoll creates an index to do this. The program has an impressive arsenal of utilities that help search through document content and relies on the Xapian [3] index engine.

Under the Hood

If you take a closer look, you will notice that Recoll is simply a front-end for the lean, but excellent, Xapian search machine. Xapian uses a fairly sophisticated approach to acquiring information. The library is responsible for creating the index. Xapian uses a small database to remember which word occurs at what position in each document.

Installation

Users with Fedora Core 5; Mandriva 2005, 2006, or 2007; SUSE 10.1; or Ubuntu 6.10 can relax at this stage. The Recoll homepage has a version for your distribution that you can download and install using your standard package manager. The search machine will "automagically" appear in your start menu after doing so.

Users of all other distributions will need to download the source code archive and build their own executables. To do so, surf to the Xapian project homepage [3] and download the xapian-core package. In a directory of your choice, you should unpack the package, open a terminal window, become root, and run the following commands:

./configure
make install
ldconfig

Make sure you have installed the Qt developer library. Some distributions include this, whereas you will need to install the library manually for others. The library is usually identifiable by its suffix of -devel or -dev. Recoll requires version 3.3.5 or newer.

After completing this preparatory work, it's time to install the search engine itself. To do so, you must first download the archive from the Recoll homepage [1] and unpack the archive, then type ./configure and make to install the application, and make install to drop the application into the right paths on your system. You also will need to manually add Recoll to your start menu, but you can always launch the application by entering recoll in a terminal window if you prefer.

Index for a Brain

When first launched, Recoll starts to poke around in your home directory, investigating any files it finds there and storing their characteristics in an index (Figure 1). Thanks to the information stored in the index, Recoll finds documents containing your search keys much faster, so it makes sense to click on Ok when prompted. The main window then appears with a status bar at the bottom edge and a progress indicator for the indexing process.

Figure 1: Recoll creates an index on initial launch.

Unfortunately, the search engine relies completely on the index file, and there is no way to avoid updating the index by selecting File | Update Index at regular intervals.

If you prefer to automate this process, you will need the recollindex command-line program, which you can launch regularly via your crontab. Most distributions have special configuration programs that help you create a crontab entry for this.

Another issue arises because the index can grow quite dramatically. You can expect the index to grow to the size of your own home directory, but it might get even bigger.

It's hard to say how long Recoll will take to create the index because it depends on the volume of data and on how fast your computer is, so expect your mileage to vary.

The tool takes about half an hour to investigate 15GB. After investigating a document, Recoll will not try to read it again - not even if you update the index manually - unless you modify the document. This means that documents will be ignored in searches if you delete them, if Recoll has not seen them, or if you move them to another directory.

Finders Keepers

After creating the index, you can start entering search keys. To do so, type a word or a couple of words in the input box above the results box, and click Search. Recoll then checks the index, and finally presents a list of documents with the key you searched for (Figure 2). Like the Google Internet search engine, the software classifies the documents it finds by relevance to the search keys.

Figure 2: Recoll found several documents with the term "qt development overview."

Assuming the search engine guessed right, the top entry should be the document you are looking for. To launch a matching viewer, you can click Edit. To modify your choice of viewing applications, you use the settings in Preferences | Query configuration below Manage in the User Interface tab.

By default, Recoll lists all documents that contain at least one of your search keys. If you only want to see results that contain all of your search keys, just select All terms in the list to the left of the input box. Quotes allow you to search for phrases.

Tools | Advanced Search takes you to the input form in Figure 3, in which you can specify more details for the search. This form gives users the ability to specify words that the document is not allowed to contain and restricts the search to specific file formats at the bottom of the form.

Figure 3: In the advance search form, you can restrict the search. In this case, Recoll will search documents that contain the word qt and are PDF documents.
A Question of Formats

By default, Recoll supports searching in text, HTML, and OpenOffice format files as well as the Maildir and Mbox mailbox formats. To search in files that are other formats, you need a couple of tools, including:

  • PDF: pdftotext (part of the Xpdf package)
  • PostScript: pstotext
  • Word: Antiword
  • Excel and PowerPoint: catdoc
  • RTF: UnRTF
  • DVI: dvips
  • DjVu: DjVuLibre
  • MP3: If id3lib is installed, Recoll will parse ID3 tags.

Box of Tricks

Sometimes you can't exactly recall the word you are looking for. "Money" is a good choice for a letter to the Internal Revenue Service, but it might just as easily have been "monetary." If you install the Aspell dictionary, Recoll will automatically take care of this by finding all words with the same word stem.

The program will also conjugate verbs. To make sure that Recoll uses the correct language, you need to specify the language in Preferences | Query configuration by clicking the Search parameters tab and setting the Stemming language.

By default, Recoll will install english. A quick glance at .recoll/xapiandb revealed a subdirectory called stem_english with a couple of index files. This sample configuration shows that you need an indexstemminglanguages option for any other languages that you intend to use stemming with.

Recoll relies on the Snowball engine [4] in addition to Xapian for this kind of search. The website reveals more about the languages - the software creates English-only word-stem files, so to add an entry for Spanish, you would first need to modify the option entry to english spanish to create indexes for both languages. After doing so, re-index by running Recollindex to create a stem_spanish subdirectory based on the Spanish dictionary.

A typo in the search term completely throws the search algorithm (see Figure 4). In tricky cases like this, you can resort to the Term Explorer (Tools | Term explorer).

Figure 4: Phonetic mode is helpful when you have a typo in your search term.

The Term Explorer lets you use Wildcards and Regexps on the list of terms in the index, perform Stem expansion, or run Aspell against the search key (Spelling/Phonetics).

In the latter example, you would have had to install Aspell and the required dictionaries. Our experiments with a new Ubuntu 6.10 installation showed it was impossible to use Spanish from scratch, despite choosing Spanish as the default language.

Some manual work was necessary here to install aspell-es, but the feature worked after doing so.

You can browse the Term Explorer just as you would leaf through a dictionary by typing a search key and clicking on Expand. For wildcard or regular expression searches or for word-stem expansion, the program returns a list of matches and a number. Double-clicking a term drops the term into the search box. It's a bit unfortunate that the Term Explorer window hides the search bar at this point, leaving you wondering what double-clicking actually does.

The Wildcard and Regexp features both understand symbols such as * or ?. However, the way these features work is different and takes some getting used to. For example, Wildcard with *ing finds words that end in ing in the documents included in the search. In contrast to this, Regexp needs you to enter [A-z]*ing to do something similar; of course, this regexp does not find words with numbers before the ing string.

Keep Out

By default, Recoll will search all documents in the current user's home directory. If you want to change this behavior and exclude a couple of subdirectories from the search, you need to manually edit the configuration file with the current version of Recoll. The file is located in a hidden .recoll directory, along with the index file.

With your favorite text editor, open the recoll.conf file, add a topdirs = entry if it does not already exist, then add the names of the directories you want to exclude from the search, separating the names with blanks. The search engine automatically excludes any subdirectories below these directories from the search.

Conclusions

Recoll is a fast, lean desktop search engine that is simple to use but still offers a useful collection of functions for performing more granular searches. The only point of criticism is the bulky index file that consumes vast amounts of valuable disk space. If you have a large collection of documents or find yourself searching manually more often than you would care to, Recoll might be just the tool you need.

INFO
[1] Recoll: http://www.recoll.org
[2] Beagle: http://www.gnome.org/projects/beagle/
[3] Xapian project: http://www.xapian.org
[4] Snowball engine: http://snowball.tartarus.org