Archiving Email Messages with HypermailCleaning UpAndrea Müller |
![]() |
Nearly everyone keeps postcards and letters from their loved ones. And even if you don't, you probably keep at least one file at home with letters from authorities, banks, and insurance companies. If your important documents are filed in an organized and accessible way, you are more likely to find a document you need when you come back later. Why not apply the same principle to old email messages, using Hypermail http://www.hypermail.org?
Hypermail converts your email messages into HTML documents (Figure 1). Each document contains links to any preceding or answering messages in the thread. The program stores attachments in a subfolder and places a link to the attachment in the HTML file. To allow you to find the messages you are looking for, Hypermail also generates a number of index pages, where the messages are sorted by subject, author (Figure 2), date, and thread. Additionally, Hypermail generates an attachment.html file with a list of mail attachments.
Some distributions - such as Suse Linux Professional, Mandrake Power Pack, or Debian - include Hypermail by default. If Hypermail is included with the distribution, you can simply run your distribution's package manager to install it. If your distribution does not include Hypermail, or if you would prefer to use the latest version of the mail archiver, you will need the source code to build Hypermail yourself. The source archive is available from the project homepage. Hypermail is unlikely to give you a hard time. Assuming that the gcc, make, and glibc-devel packages are available on your system, you can simply type ./configure, make and su -c "make install" to build and install the program. The call to make install copies the program and accompanying files to a directory below /usr/local.
Hypermail only accepts the mbox format as input (see the box titled "mbox Format"). Some mail programs - Evolution and Mozilla, for example - use the mbox file format by default to store email messages. If you use one of these applications, all you need to do is to create a separate folder for the files you want to archive. The mbox file will have the same name as the folder in the mail client. Users with Mozilla need to look at the directory tree below their profile folder ~/.mozilla/default/xxxxxxx/Mail, where xxxxxxx is an arbitrary string that Mozilla uses to identify the profile. You should find a directory with the same name as your mail account; this is where Mozilla stores mbox files without a filename extension. Evolution stores mail in a folder called ~/.evolution.
If your mail client does not use mbox format, it may still have a function for creating an mbox folder, or it may allow you to export messages into mbox format. KMail asks you to specify the format when you create a new folder (Figure 3) and stores the folder below the ~/Mail directory. Sylpheed users can use the Export to Mbox file function in the File menu.
After installing Hypermail, enter the following command
hypermail -m mailbox -d U outputdirectory
to create your first mail archive. Specify the path to your mbox file with the -m parameter. -d is the directory where Hypermail will create the archive. You do not need to create the directory before running the command; Hypermail will take care of that automatically. When the program has finished, you should discover the HTML-formatted messages and the index file in the target directory. The index.html file contains a thread list by default; you can click on one of the links to go to a message. If you like, you can move to one of the other indexes (author, date, sender or attachments).
Hypermail supports several languages other than English. If you prefer to use a language other than English for the HTML pages, add -L, followed by the language parameter, as follows:
hypermail -L es -m mailbox -d U folder
sets the language to Spanish. Besides Spanish, Hypermail can give you Italian (it), Russian (ru), and German (de), for example. The -h flag gives you a list of supported languages in the line that starts with -L.
The headings on the index pages include the archive name - Hypermail sets this to the name of the mbox file. To assign an individual name, use the -l command line switch followed by the name. It is not necessary to run Hypermail multiple times to merge multiple mailboxes to a single HTML archive; instead, simply specify the names of the mbox files you want to merge as the -m parameter value. Hypermail allows you to add messages to an archive at a later date. The -u parameter tells Hypermail to update the archive.
hypermail -u -m newbox -d folder
adds the messages in newbox to an archive in the folder directory. Of course, Hypermail will update the index files to reflect the changes.
Let's assume you want to add messages that you have composed yourself, and that are now sitting in your Outbox, to the archive, and your mail program does not generate message IDs. You cannot use the default setting here, as this just provokes an error message: Message-ID is missing, ignoring message with subject `subject'. The -o require_msgids=0 option tells Hypermail to process messages of this kind. -o is short for options, and believe me, the program has quite a few of them. You can type man hmrc for a list.
If you are archiving mailboxes with a large number of entries, you might like to take a closer look at the monthly-index=1 and folder_by_date options. The former option tells Hypermail to add an overview to the index.html file, which points to monthly indexes (Figure 4). This is a good thing speed-wise, as an index file for a few thousand messages can easily reach a size of 10MByte or more. In contrast to this, the browser should load a monthly index in next to no time. This option does not mean that the HTML files will be placed in separate folders, however. To distribute your files over multiple monthly folders, you need the Hypermail folder_by_date option. Let's combine this feature with a monthly index:
hypermail -m mbox -d U folder -o U monthly-index=1 -o U folder_by_date=%y%m
The %y%m is a so-called format string, where %y stands for the year, and %m for the month when the message was created. This command tells Hypermail to create subfolders with names such as 0312 below the output directory. Messages from December 2003 would be stored in this subfolder. If you prefer to have the month first, simply switch the order of the format string: folder_by_date=%m%y
If you are planning to publish archives on a web server, for example, the -o spamprotect=1 option is a good idea. This option tells Hypermail to modify the mail addresses. Instead of name@domain, the program writes name_at_domain. This makes it more difficult for spammers to harvest target addresses. You can use the -o antispam_at=replacementcharacter option to tell Hypermail what to write instead of the @ character.
mbox Format |
mbox is the generic term for a family of legacy mailbox formats used by Unix hosts to store mail messages. The name reflects the approach that mbox adopts for storing multiple messages in a single mailbox file. The various mbox formats were categorized by Daniel J. Bernstein, Rahul Dhesi, and others in 1996 and include mboxo, mboxrd , mboxcl, and mboxc12. Within an mbox file, incoming messages are appended to the file using the "From " string to identify the beginning and an empty line to identify the end of each message. Because the mbox format was never officially specified in an RFC, it leaves programmers of mail user agents some scope for designing incompatible formats. An example mbox file is shown in Listing 1. |
Listing 1: An mbox file |
01 From user@example.com Sat Jun 14 14:45:12 2003 02 Received: from localhost (localhost.localdomain [127.0.0.1]) 03 by anmen.not-for-mail (8.11.6/8.11.6) with ESMTP id h5ECjBA29295 04 for 01 ; Sat, 14 Jun 2003 14:45:11 +0200 02 Message-ID: <3EEB0E35.C0077C5@example.com> 03 Date: Sat, 14 Jun 2003 13:59:49 +0200 04 From: User Domain 01 To: a414@sedacon.com (Marc Andre Selig) 02 Subject: Test mail 03 04 Hello! 05 06 From a414@sedacon.com Sat Jun 14 14:48:14 2003 07 Date: Sat, 14 Jun 2003 14:48:14 +0200 08 From: a414@sedacon.com 09 To: a414@sedacon.com (Marc Andre Selig) 10 Subject: Another test 11 12 Yet another test. If the body text of a message just happens to have an empty line followed by a line starting with "From" and then a blank space, the "From" string is replaced by ">From" so the line will not look like the start of a new message. Stringing messages together in a single large file makes for efficient use of the inodes on a filesystem. On the downside, mbox files become slower and less responsive as they grow. Another disadvantage of the mbox format is that it requires locking, so that multiple programs will not access the file in parallel. |
The program has a few more useful features, such as a quote tagging option. If you want to highlight quotes, rather than just using the quote character (typically >), you could specify -o iquotes=1 to use italics for quotes. The -o linkquotes=1 option is also useful. This option tells Hypermail to generate a link from the first quote to take you to the original message.
Options are a good thing, but they do have a downside: nobody can remember all of them. This typically means checking the manpage every time you need to run the archiving tool. Once you have found a set of options that are perfect for your requirements, there is a clever way of avoiding having to enter them every time you run Hypermail. When Hypermail launches, it parses the .hmrc file in your home directory. This means you can specify the number of header lines you want to see; you can even define monthly folders and the path to the mailbox file. Options are written just like they are on the command line. Let's look at an example with the command line switch -o. The following entry
require_msgids=0
tells Hypermail to archive messages without a message ID. Command line options have priority over entries in .hmrc. You can specify a default mailbox but still change the mailbox using the -m option, followed by the input file name. Listing 2 is a sample .hmrc.
If you tidy up regularly, you can look forward to a quicker and less cluttered mail program. And you can browse older messages any time you like. The Hypermail indexes look great, even if you use a text-based browser. If your archives are taking up too much of your hard disk, you can simply swap an archive out to a CD.
GLOSSARY |
Message ID: A unique number in the email header, which is comprised of an arbitrary local part, the @ character, and a domain part. The message ID could be abcdefghijkl@example.com, for example. The uniqueness of the message ID is useful for Usenet, as most news servers simply reject messages with a message ID they have seen before. |
Listing 2: ~/.hmrc Options |
01 #create Spanish language page 02 language=es 03 04 #European date format 05 eurodate=1 06 07 #Standard mailbox 08 mbox=/home/andi/archiv 09 10 #Links to quotes 11 linkquotes=1 12 13 #Create monthly folders 14 folder_by_date=%y%m 15 16 #Display headers 17 showheaders=1 18 19 #Header lines that Hypermail should display 20 show_headers=From,To,Subject, Date,Message-ID,User-Agent,X-Mailer,X-Newsreader 21 22 #Display quotes in italics 23 iquotes=1 24 25 #Do not archive messages with X-Hypermail-Deleted in header 26 deleted=X-Hypermail-Deleted 27 28 #No mailto link 29 mailto=none 30 31 #Create monthly index 32 monthly_index=1 |