Suppose that the links we want to check are in a remote HTML file that's not quite as rigidly formatted as my local bookmark file. Suppose, in fact, that a representative section looks like this:
<p>Dear Diary, <br>I was listening to <a href="http://www.freshair.com">Fresh Air</a> the other day and they had <a href ="http://www.cs.Helsinki.FI/u/torvalds/">Linus Torvalds</a> on, and he was going on about how he wrote some kinda <a href="http://www.linux.org/">program</a> or something. If he's so smart, why didn't he write something useful, like <a href="why_I_love_tetris.html">Tetris</a> or <a href="../minesweeper_hints/" >Minesweeper</a>, huh?
In the case of the bookmarks, we noted that links were each alone on a line, all absolute, and each capturable with m/ HREF="([^"\s]+)" /. But none of those things are true here! Some links (such as href="why_I_love_tetris.html") are relative, some lines have more than one link in them, and one link even has a newline between its href attribute name and its ="..." attribute value.
Regexps are still usable, though—it's just a matter of applying them to a whole document (instead of to individual lines) and also making the regexp a bit more permissive:
while ( $document =~ m/\s+href\s*=\s*"([^"\s]+)"/gi ) {
my $url = $1;
...
}
(The /g modifier ("g" originally for "globally") on the regexp tries to match the pattern as many times as it can, each time picking up where the last match left off.)
Example 6-5 shows this basic idea fleshed out to include support for fetching a remote document, matching each link in it, making each absolute, and calling a checker routine (currently a placeholder) on it.
#!/usr/bin/perl -w
# diary-link-checker - check links from diary page
use strict;
use LWP;
my $doc_url = "http://chichi.diaries.int/stuff/diary.html";
my $document;
my $browser;
init_browser( );
{ # Get the page whose links we want to check:
my $response = $browser->get($doc_url);
die "Couldn't get $doc_url: ", $resp->status_line
unless $response->is_success;
$document = $response->content;
$doc_url = $response->request->base;
# In case we need to resolve relative URLs later
}
while ($document =~ m/href\s*=\s*"([^"\s]+)"/gi) {
my $absolute_url = absolutize($1, $doc_url);
check_url($absolute_url);
}
sub absolutize {
my($url, $base) = @_;
use URI;
return URI->new_abs($url, $base)->canonical;
}
sub init_browser {
$browser = LWP::UserAgent->new;
# ...And any other initialization we might need to do...
return $browser;
}
sub check_url {
# A temporary placeholder...
print "I should check $_[0]\n";
}
When run, this prints:
I should check http://www.freshair.com/ I should check http://www.cs.Helsinki.FI/u/torvalds/ I should check http://www.linux.org/ I should check http://chichi.diaries.int/stuff/why_I_love_tetris.html I should check http://chichi.diaries.int/minesweeper_hints/
So our while (regexp) loop is indeed successfully matching all five links in the document. (Note that our absolutize routine is correctly making the URLs absolute, as with turning why_I_love_tetris.html into http://chichi.diaries.int/stuff/why_I_love_tetris.html and ../minesweeper_hints/ into http://chichi.diaries.int/minesweeper_hints/ by using the URI class that we explained in Chapter 4, "URLs".)
Now that we're satisfied that our program is matching and absolutizing links correctly, we can drop in the check_url routine from the Example 6-4, and it will actually check the URLs that the our placeholder check_url routine promised we'd check.
Copyright © 2002 O'Reilly & Associates. All rights reserved.