LJ Archive CD

A Web-Based Clipping Service

Reuven M. Lerner

Issue #68, December 1999

Let LWP turn your web client into a midnight marauder.

In November, we saw how Perl's Library for Web Programming (LWP) can be used to create a simple HTTP client, retrieving one or more pages from the Web. This month, we will extend those efforts to create a program that can not only retrieve pages from the Web, but categorize them according to our preferences. In this way, we can create our own web-based clipping service, finding those articles that are of particular interest to us.

LWP consists of several modules which allow us to work with HTTP, the “hypertext transfer protocol”. HTTP works on a stateless request-response basis: a client connects to a server and submits a request. The server then generates a response, and closes the connection. (If you missed last month's column, it is available here: Working with LWP. You should read that article before continuing.)

Downloading Files

We need a program that will go to a particular URL and save the contents of that URL on disk. Furthermore, we want to follow any hyperlinks in that document to collect other news stories. However, we do not want to follow links to other sites; this not only reduces the chances that we will get sidetracked, but avoids the possibility of being led astray too much.

In other words, I would like to be able to point a program at a site and retrieve all of its files on to the disk. A first stab at such a program, download-recursively.pl, is similar to the simple robot program we explored last month. It uses two hashes, %already_retrieved and %to_be_retrieved, to store URLs. Rather than storing the URLs as values in the hash, we use them as keys. This ensures each URL will appear only once, avoiding infinite loops and miscounting. URLs are placed in %to_be_retrieved when they are first encountered, then moved to %already_retrieved after their contents are retrieved. $origin, a scalar variable that contains the initial URL, has a default setting if no argument is provided on the command line.

Retrievals are performed with a while loop. Each iteration of the while loop retrieves another URL from %to_be_retrieved, and uses it to create a new instance of HTTP::Request.

The method $response->last_modified returns the date and time on which a document was last modified. Subtracting $response->last_modified from the current time, and then comparing this result with the maximum age of documents we wish to see ($maximum_age) allows us to filter out relatively old documents:

my $document_age = time -
   $response->last_modified;
    if ($document_age > $maximum_age)
    {
        print STDOUT
        "  Age of document: $document_age\n";
        next;
    }

If the document is too old, we use next to return us to the next iteration of the while loop—and thus the next URL to be retrieved.

Next, we parse the contents of the HTTP response, using the HTML::LinkExtor module. When we create an instance of HTML::LinkExtor, we are actually creating a simple parser that can look through a page of HTML and return one or more pieces of information. The analysis is performed by a subroutine, often named callback. A reference to callback is passed along with the URL that will be parsed to create a new instance of HTML::LinkExtor.

my $parser = HTML::LinkExtor->new (\&callback, $url);

The resulting object can then parse our URL's contents by invoking:

$parser->parse($response->content);
When $parser->parse is invoked, &callback is invoked once for each HTML tag in the file. Our version of &callback grabs each URL in the file from the href attribute of each <a> tag, saving it in %to_be_retrieved unless it exists in %already_retrieved.

Finally, we save the retrieved document on the local file system. The tricky part of saving the file to disk has to do with the way in which we are retrieving the URLs—we are not traversing a tree of URLs, but are pulling URLs out in their hash order. This means the URL http://foo.com/a/b/c/ might be retrieved before http://foo.com/a/index.html. Thus, we need to ensure that the directory /a/b/c exists on our local system before /a and /a/b are created. How can we do this?

My solution was to use Perl's built-in split operator, which turns a scalar into a list. By assigning this list of partial directories into an array (@output_directory), we can sequentially build up the directory from the root (/) down to the final name. Along the way, we check to see if the directory exists. If it does not, we create the new directory with mkdir. If the directory does not exist and mkdir fails, we exit with a fatal error, indicating what error occurred.

Those URLs that lack a file name are given one of “index.html”. Since this is the default file name accessed on many web servers, it stands to reason this will probably not collide with any of those names.

The end result of running this program is a mirror of the downloaded site, stored in $output_directory.

Sorting Through the Output

It is handy to be able to download all or part of a web site. However, our initial goal was to be able to sort through the contents of a web site for one or more phrases of interest to us.

Such a program is not very different from download-recursively.pl. Our new version, download-matching.pl, differs in that it stores only messages that contain one or more of the phrases stored in an external file, phrase-file.txt. The code for both of these programs can be found in the file ftp.linuxjournal.com/pub/lj/listings/issue68/3714.tgz.

There are several ways to perform such checking and matching. I chose to do it in a relatively simple but straightforward way, iterating through each phrase in the file and using Perl's built-in string-matching mechanism.

Since the phrases will remain constant during the entire program, we load them from phrase-file.txt before the while loop begins:

my $phrase_file = "phrase-file.txt";
    my @phrases;
    open PHRASES, $phrase_file or die
    "Cannot read $phrase_file: $! ";
    while (<PHRASES>)
    {
        chomp;
        push @phrases, $_;
    }
    close PHRASES;

The above code iterates through each line of the phrase file, removing the trailing newline (with chomp) and then storing the line in @phrases. Each phrase must be on its own line in the phrase file; one possible file could look like this:

Linux
Reuven
mortgage
Once @phrases contains all of the phrases for which we want to search, download-matching.pl proceeds much like its less discriminating predecessor. The difference comes into play after the callback has already been invoked, scanning through the file for any new links. A site's table of contents might not contain any of the strings in @phrases, but the documents to which it points might.

After collecting new links, but before writing the file to disk, download-matching then iterates through the phrases in @phrases, comparing each one with the document. If it finds a match, it sets $did_match to 1 and exits from the loop:

foreach my $phrase (@phrases)
    {
        if ($content =~ m/>.*[^<]*\b$phrase\b/is)
        {
            # Did we match?
            $did_match = 1;
            print "        Matched $phrase\n";
            # Exit from the foreach if we found a
            # match
            last;
        }
    }

Notice how we surround $phrase with \b. This is Perl's way of denoting a separation between words, and ensures that our phrases do not appear in the middle of a word. For instance, if we were to search for “vest”, the \b metacharacters ensure that download-matching.pl will not match the word “investments”.

If $did_match is set to a non-zero value, at least one of the phrases was found in the document. (We use the /i option to Perl's m// matching operator to indicate that the search should be case-insensitive. If you prefer to make capital letters distinct from lowercase letters, remove the /i.) If $did_match is set to 0, we use next to go to the next iteration of the while loop, and thus to the next URL in %to_be_retrieved.

This all presumes a Boolean “or” match, in which only one of the phrases needs to match. If we want to insist that all of our phrases appear in a file to get a positive result (an “and” match), we must alter our strategy somewhat. Instead of setting $did_match to 1, we increment it each time a match is found. We then compare the value of $did_match with the number of elements in @phrases; if they are equal, we can be sure all of the phrases were contained in the document.

If possible, we want to avoid matching text contained within HTML tags. While writing this article for instance, I was surprised to discover just how many articles on Wired News (a technical news source) matched the word “mortgage”. In the end, I found the program was matching a phrase within HTML tags, rather than the text itself. We can avoid this problem by stripping the HTML tags from the file—but that would mean losing the ability to navigate through links when reading the downloaded files.

Instead, download-matching.pl copies the contents of the currently examined file into a variable ($content) and removes the HTML tags from it:

my $content = $response->content;
    $content =~ s|<.+?>||gs;

Notice how we use the g and s options to the substitution operator (s///), removing all pairs of HTML tags, even if they are separated by a newline character. (The s option includes the newline character in the definition of ., which is normally not the case.)

We avoid the ramifications of a greedy regular expression, in which Perl tries to match as much as possible, by putting a ? after the +. If we were to replace <.+>, rather than <.+?>, we would remove everything between the first < and the final > in the file—which would probably include the contents, as well as the HTML tags.

One final improvement of download-matching.pl over download-recursively.pl is that it can handle multiple command-line arguments. If @ARGV contains one or more arguments, these are used to initially populate %to_be_searched. If @ARGV is empty, we assign a default URL to $ARGV[0]. In both cases, the elements of @ARGV are turned into keys of %to_be_retrieved:

foreach my $url (@ARGV)
    {
        print "    Adding $url to the list...\n"
        if $DEBUGGING;
        $to_be_retrieved{$url} = 1;
    }

Using download-matching.pl

Now that we have a program to retrieve documents that fit our criteria, how can we use it? We could run it from the command line, but the point of this program is to do your work for you, downloading documents while you sleep or watch television.

The easiest way is to use cron, the Linux facility that allows us to run programs at regular intervals. Each user has his or her own crontab, a table that indicates when a program should be run. Each command is preceded by five columns that indicate the time and date on which a program should be run: the minute, hour, day of month, month and day of the week. These columns are normally filled with numbers, but an asterisk can be used to indicate a wild card.

The following entry in a crontab indicates the program /bin/foo should be run every Sunday at 4:05 A.M.:

5 4 * * 0 /bin/foo

Be sure to use a complete path name when using cron—here we indicated /bin/foo, rather than just “foo”.

The crontab is edited with the crontab program, using its -e option. This will open the editor defined in the EDITOR environment variable, which is vi by default. (Emacs users should consider setting this to emacsclient, which loads the file in an already running Emacs process.)

To download all of the files matching our phrases from Wired News every day at midnight, we could use the following:

0 0 * * 0 /usr/bin/download-matching.pl\
www.wired.com/news/http://www.wired.com/news/

This will start the process of downloading files from http://www.wirec.com/news/ at midnight, placing the results in $output_directory. We can specify multiple URLs as well, allowing us to retrieve news from more than one of our favorite news sources. When we wake up in the morning, new documents that interest us will be waiting for us to read, sitting in $output_directory.

Conclusion

Many organizations hire clipping services to find news that is of interest to them. With a bit of cleverness and heavy reliance on LWP, we can create our own personalized clipping service, downloading documents that reflect our personal interests. No longer must you look through a list of headlines in order to find relevant documents—let Perl and the Web do your work for you.

Resources

Reuven M. Lerner is an Internet and Web consultant living in Haifa, Israel. His book Core Perl will soon be published by Prentice-Hall. Reuven can be reached at reuven@lerner.co.il. The ATF home page is at http://www.lerner.co.il/atf/.

LJ Archive CD