Book HomePerl & LWPSearch this book

12.3. Example: A Link-Checking Spider

So far in the book, we've produced little single-use programs that are for specific tasks. In this section, we will diverge from that approach by walking through the development of a Type Three Requester robot whose internals are modular enough that with only minor modification, it could be used as any sort of Type Three or Type Four Requester.

12.3.1. The Basic Spider Logic

The specific task for our program is checking all the links in a given web site. This means spidering the site, i.e., requesting every page in the site. To do that, we request a page in the site (or a few pages), then consider each link on that page. If it's a link to somewhere offsite, we should just check it. If it's a link to a URL that's in this site, we will not just check that the URL is retrievable, but in fact retrieve it and see what links it has, and so on, until we have gotten every page on the site and checked every link.

So, for example, if I start the spider out at http://www.mybalalaika.com/oggs/, it will request that page, get back HTML, and analyze that HTML for links. Suppose that page contains only three links:

http://bazouki-consortium.int/
http://www.mybalalaika.com/oggs/studio_credits.html
http://www.mybalalaika.com/oggs/plinky.ogg

We can tell that the first URL is not part of this site; in fact, we will define "site" in terms of URLs, so a URL is part of this site if starts with this site's URL. So because http://bazouki-consortium.int doesn't start with http://www.mybalalaika.com/oggs/, it's not part of this site. As such, we can check it (via an HTTP HEAD request), but we won't actually look at its contents for links. However, the second URL, which is http://www.mybalalaika.com/oggs/studio_credits.html, actually does start with http://www.mybalalaika.com/oggs/, so it's part of this site and can be retrieved and scanned for links. Similarly, the third link, http://www.mybalalaika.com/oggs/plinky.ogg, does start with http://www.mybalalaika.com/oggs/, so it's part of this site and can be retrieved, and its HTML checked for links.

But I happen to know that http://www.mybalalaika.com/oggs/plinky.ogg is a 90-megabyte Ogg Vorbis (compressed audio) file of a 50-minute long balalaika solo, and it would be a very bad idea for our user agent to go getting this file, much less to try scanning it as HTML! So the way we'll save our robot from this bother is by having it HEAD any URLs before it GETs them. If HEAD reports that the URL is gettable (i.e., doesn't have an error status, nor a redirect) and that its Content-Type header says it's HTML (text/html), only then will we actually get it and scan its HTML for links.

We could always hardcode a list of strings such as .gif, .jpg, etc., including .ogg, such that any URL ending in any such string will be assumed to not be HTML. However, we could never know that our list is complete, so we must carefully avoid the possibility of ever downloading a massive binary file that our suffix list just didn't happen to catch.

Now, what to do if we check (or try to get) a URL, and we get an error status? We will have to make note of this in some way. Now, at bare minimum we could do something like have a hash called %notable_url_error, and when we see an error, we could do:

$notable_url_error{$url} = $response->status_code;

In fact, we will be a bit more ambitious in our program, by also making note of what links to what, so that in the end, instead of saying "something links to http://somebadurl.int, but it's 404 Not Found," we can list the URLs that link to it, so that those links can be fixed.

Incidentally, when we get http://www.mybalalaika.com/oggs/studio_credits.html and scan its HTML, suppose it contains a link to http://www.mybalalaika.com/oggs/. We shouldn't go and request that URL, because we've already been there. So we'll need to keep track of what we've already seen. This is as simple as having a hash %seen_url_before, and when we see a URL, if we see $seen_url_before{$url} is true, we'll skip it. But if it's false, we know we haven't dealt with this URL before, so we can set $seen_url_before{$url} = 1 and go deal with it, for what we can be sure will be the only time this session.

12.3.2. Overall Design in the Spider

Now that we've settled on the basic logic behind the spider, we can start coding. For example, our idea of how to process a URL is expressed as this simple routine:

sub process_url {
  my $url = $_[0];
  if( near_url($url) )   { process_near_url($url) }
  else                   { process_far_url($url) }
  return;
}

This is the first of the two dozen routines (mostly small) that make up this spider framework, and clearly it requires us to write three more routines, near_url( ), process_near_url( ), and process_far_url( ). But before we go further, we must consider the question of how we would interact with the program. Ideally, we can just write it as a command-line utility that we start up and let run, and in the end it will email us. So, in theory, we could call it like so:

% thatdarnedbot http://mybazouki.com/ | mail $USER &

Then we don't have to think about it again until the program finishes and the report it generates comes to our mailbox. But that is like tightrope-walking without a net, because suppose we get email from someone saying "Hey, wassamatta you? A bot from your host just spent a solid hour hammering my server, checking the same links over and over again! Fix it!" But if all we have is a bad links report, we'll have no idea why the bot visited his site, whether it did indeed request "the same links" over and over, or even what URLs it visited (aside from the ones we see in our bad links report), so we'd have no idea how to fix the problem.

To avoid that situation, we must build logging into the spider right from the beginning. We'll implement this with two basic routines: say( ), used for important messages, and mutter( ), used for less important messages. When we have a part of the program call say( ), like so:

say("HEADing $url\n");

That is a message that we'll save in a log file, as well as write to STDOUT for the edification of the user who's watching the process. We can call mutter( ), like so:

mutter("  That was hit #$hit_count\n");

That message will be saved to the log file (in case we need it), but isn't considered important enough to send to STDOUT, unless of course the user is running this program with a switch that means "say everything to STDOUT, no matter how trivial."

And because it's helpful to know not just what happened but when, we'll make say( ) and mutter( ) emit a timestamp, unless it's the same time as the last thing we said or muttered. Here are the routines:

my $last_time_anything_said;
sub say {
  # Add timestamps as needed:
  unless(time( ) == ($last_time_anything_said || 0)) {
    $last_time_anything_said = time( );
    unshift @_, "[T$last_time_anything_said = " .
      localtime($last_time_anything_said) . "]\n";
  }
  print LOG @_ if $log;
  print @_;
}

my $last_time_anything_muttered;
sub mutter {
  # Add timestamps as needed:
  unless(time( ) == ($last_time_anything_muttered || 0)) {
    $last_time_anything_muttered = time( );
    unshift @_, "[T$last_time_anything_muttered = " .
      localtime($last_time_anything_muttered) . "]\n";
  }
  print LOG @_ if $log;
  print @_ if $verbose;
}

This relies on a flag $log (indicating whether we're logging), a filehandle LOG (open on our log file, if we are logging), and a flag $verbose that signals whether mutter messages should go to STDOUT too. These variables will be set by code that you'll see in the complete listing at the end of this chapter, which simply gets those values from @ARGV using the standard Perl module Getopt::Std.

With those two logging routines in place, we can return to our first substantial routine, here repeated:

sub process_url {
  my $url = $_[0];
  if (near_url($url))   { process_near_url($url) }
  else                  { process_far_url($url) }
  return;
}

Not only does this implicate near_url(), process_near_url(), and process_far_url(), but it also begs the question: what will actually call process_url()? We will implement the basic control of this program in terms of a schedule (or queue) of URLs that need to be processed. Three things need to be done with the schedule: we need a way to see how many entries there are in it (at least so we can know when it's empty); we need to be able to pull a URL from it, to be processed now; and we need a way to feed a URL into the schedule. Call those functions schedule_count( ), next_scheduled_url( ), and schedule($url) (with code that we'll define later on), and we're in business. We can now write the main loop of this spider:

my $QUIT_NOW;
 # a flag we can set to indicate that we stop now!
 
sub main_loop {
  while(
    schedule_count( )
    and $hit_count < $hit_limit
    and time( ) < $expiration
    and ! $QUIT_NOW
  ) {
    process_url( next_scheduled_url( ) );
  }
  return;
}

This assumes we've set $hit_limit (a maximum number of hits that this bot is allowed to perform on the network) and $expiration (a time after which this bot must stop running), and indeed our @ARGV processing will get those from the command line. But once we know that's the program's main loop, we know that the program's main code will just be the processing of switches in @ARGV, followed by this code:

initialize( );
process_starting_urls(@ARGV);
main_loop( );
report( ) if $hit_count;
say("Quitting.\n");
exit;

And from this point on, the design of the program is strictly top-down stepwise refinement, just fleshing out the details of the remaining routines that we have mentioned but not yet defined.

12.3.3. HEAD Response Processing

Consider our basic routine, repeated again:

sub process_url {
  my $url = $_[0];
  if( near_url($url) )   { process_near_url($url) }
  else                   { process_far_url($url) }
  return;
}

The first thing this needs in a function that, given a URL, can tell whether it's "near" or not, i.e., whether it's part of this site. Because we've decided that a URL is part of this site only if it starts with any of the URLs with which we started this program, just as http://www.mybalalaika.com/oggs/studio_credits.html starts with http://www.mybalalaika.com/oggs/, but http://bazouki-consortium.int/ doesn't. This is a simple matter of using substr( ):

my @starting_urls;

sub near_url {   # Is the given URL "near"?
  my $url = $_[0];
  foreach my $starting_url (@starting_urls) {
    if( substr($url, 0, length($starting_url))
     eq $starting_url
     # We assume that all URLs are in canonical form!
    ) {
      mutter("  So $url is near\n");
      return 1;
    }
  }
  mutter("  So $url is far\n");
  return 0;
}

We will have to have fed things into @starting_urls first, and we can do that in the process_starting_urls( ) routine that gets called right before we start off the program's main loop. That routine needn't do anything more than this:

sub process_starting_urls {
  foreach my $url (@_) {
    my $u = URI->new($url)->canonical;
    schedule($u);
    push @starting_urls, $u;
  }
  return;
}

Note that we feed URLs through the canonical method, which converts a URL to its single most "proper" form; i.e., turning any capital letters in the hostname into lowercase, removing a redundant :80 port specification at the end of the hostname, and so on. We'll use the canonical method throughout this program when dealing with URLs. If we had failed to use the canonical method, we would, for example, not know that http://nato.int, http://NATO.int/ and http://nato.int:80/ all certainly denote the same thing, in that they all translate to exactly the same request to exactly the same server.

To get process_url( ) fleshed out fully, we need to define process_near_url($url) and process_far_url($url). We'll start with the first and simplest one. Processing a "far" URL (one that's not part of any site we're spidering, but is instead a URL we're merely checking the validity of), is a simple matter of HEADing the URL.

my $robot;

sub process_far_url {
  my $url = $_[0];
  say("HEADing $url\n");
  ++$hit_count;
  my $response = $robot->head($url, refer($url));
  mutter("  That was hit #$hit_count\n");
  consider_response($response);  # that's all we do!
  return;
}

The minor routine refer($url) should generate a Referer header for this request (or no header at all, if none can be generated). This is so if our request produces a 404 and this shows up in the remote server's hit logs, that server's webmaster won't be left wondering "What on Earth links to that broken URL?" This routine merely checks the hash-of-hashes $points_to{$url}{$any_from_url}, and either returns empty list (for no header) if there's no entry for $url, or Referer => $some_url if there is an entry.

my %points_to;

sub refer {
  # Generate a good Referer header for requesting this URL.
  my $url = $_[0];
  my $links_to_it = $points_to{$url};
   # the set (hash) of all things that link to $url
  return( ) unless $links_to_it and keys %$links_to_it;

  my @urls = keys %$links_to_it; # in no special order!
  mutter "  For $url, Referer => $urls[0]\n";
  return "Referer" => $urls[0];
}

The more important routine consider_response( ) is where we will have to mull over the results of process_far_url( )'s having headed the given URL. This routine should decide what HTTP statuses are errors, and not all errors are created equal. Some are merely "405 Method Not Allowed" errors from servers or CGIs that don't understand HEAD requests; these apparent errors should presumably not be reported to the user as broken links. We could just define this routine like so:

sub consider_response {
  # Return 1 if it's successful, otherwise return 0
  my $response = $_[0];
  mutter("  ", $response->status_line, "\n");
  return 1 if $response->is_success;
  note_error_response($response);
  return 0;
}

We then further break down the task of deciding what errors are worthy of reporting and delegate that to a note_error_response( ) routine:

my %notable_url_error;  # URL => error messageS

sub note_error_response {
  my $response = $_[0];
  return unless $response->is_error;

  my $code = $response->code;
  my $url = URI->new( $response->request->uri )->canonical;

  if(  $code == 404 or $code == 410 or $code == 500  ) {
    mutter(sprintf "Noting {%s} error at %s\n",
           $response->status_line, $url );
    $notable_url_error{$url} = $response->status_line;
  } else {
    mutter(sprintf "Not really noting {%s} error at %s\n",
           $response->status_line, $url );
  }
  return;
}

This note_error_response( ) only really notes (in %notable_url_error) error messages that are 404 "Not Found", 410 "Gone", or 500 (which could be any number of things, from LWP having been unable to DNS the hostname, to the server actually reporting a real 500 error on a CGI). Among the errors that this is meant to avoid reporting is the 403 "Forbidden" error, which is what LWP::RobotUA generates if we try accessing a URL that we are forbidden from accessing by that server's robots.txt file. In practice, if you base a spider on this code, you should routinely consult the logs (as generated by the above calls to mutter) to see what errors are being noted, versus what kinds of errors are being "not really noted." This is an example of how each will show up in the log:

[T1017138941 = Tue Mar 26 03:35:41 2002]
  For http://www.altculture.com/aentries/a/absolutely.html, Referer \
  => http://www.speech.cs.cmu.edu/~sburke/
[T1017139042 = Tue Mar 26 03:37:22 2002]
  That was hit #10
  500 Can't connect to www.altculture.com:80 (Timeout)
Noting {500 Can't connect to www.altculture.com:80 (Timeout)} error \
  at http://www.altculture.com/aentries/a/absolutely.html
[T1017139392 = Tue Mar 26 03:43:12 2002]
HEADing http://www.amazon.com/exec/obidos/ASIN/1565922840
  For http://www.amazon.com/exec/obidos/ASIN/1565922840, Referer \
  => http://www.speech.cs.cmu.edu/~sburke/pub/perl.html
[T1017139404 = Tue Mar 26 03:43:24 2002]
That was hit #51
405 Method Not Allowed
Not really noting {405 Method Not Allowed} error at \
  http://www.amazon.com/exec/obidos/ASIN/1565922840

12.3.4. Redirects

Implicit in our consider_request( ) function, above, is the idea that something either succeeded or was an error. However, there is an important and frequent middle-ground in HTTP status codes: redirection status codes.

Normally, these are handled internally by the LWP::UserAgent/LWP::RobotUA object, assuming that we have left that object with its default setting of following redirects wherever possible. But do we want it following redirects at all? There's a big problem with such automatic redirect processing: if we request a URL with options appropriate for a "far" URL, and it redirects to a URL that's part of our site, we've done the wrong thing. Or, going the other way, if we GET a URL that's part of our site, and it redirects to a "far" URL, we'll have broken our policy of never GETting "far" URLs.

The solution is to turn off automatic redirect following for the $robot that we use for HEADing and GETting (by calling $robot->requests_redirectable([]) when we initialize it), and to deal with redirects ourselves, in an expanded consider_response( ) routine, like so:

sub consider_response {
  # Return 1 if it's successful, otherwise return 0
  my $response = $_[0];
  mutter("  ", $response->status_line, "\n");
  return 1 if $response->is_success;

  if($response->is_redirect) {
    my $to_url = $response->header('Location');
    if(defined $to_url and length $to_url and 
      $to_url !~ m/\s/
    ) {
      my $from_url = $response->request->uri;
      $to_url = URI->new_abs($to_url, $from_url);
      mutter("Noting redirection\n  from $from_url\n",
        "    to $to_url\n");
      note_link_to( $from_url => $to_url );
    }
  } else {
    note_error_response($response);
  }

  return 0;
}

By now we have completely fleshed out process_url( ) and everything it calls, except for process_near_url( ) and the less-important note_link_to( ) routine. Processing "near" (in-site) URLs is just an elaboration of what we do to "far" URLs. As discussed earlier, we will HEAD this URL, and if it's a successful URL (as shown by the return value of consider_response( ), remember!), and if it will contain HTML, we GET it and scan its content for links. The fully defined function seems long, but only because of our many calls to say( ) and mutter( ), and all our sanity checking, such as not bothering to GET the URL if the HEAD actually returned content, as happens now and then.

sub process_near_url {
  my $url = $_[0];
  mutter("HEADing $url\n");
  ++$hit_count;
  my $response = $robot->head($url, refer($url));
  mutter("  That was hit #$hit_count\n");
  return unless consider_response($response);

  if($response->content_type ne 'text/html') {
    mutter("  HEAD-response says it's not HTML!  Skipping ",
        $response->content_type, "\n");
    return;
  }
  if(length ${ $response->content_ref }) {
    mutter("  Hm, that had content!  Using it...\n" );
    say("Using head-gotten $url\n");
  } else {
    mutter("It's HTML!\n");
    say("Getting $url\n");
    ++$hit_count;
    $response = $robot->get($url, refer($url));
    mutter("  That was hit #$hit_count\n");
    return unless consider_response($response);
  }
  if($response->content_type eq 'text/html') {
    mutter("  Scanning the gotten HTML...\n");
    extract_links_from_response($response);
  } else {
    mutter("  Skipping the gotten non-HTML (",
      $response->content_type, ") content.\n");
  }
  return;
}

All the routines this uses are already familiar, except extract_links_from_response( ).

12.3.5. Link Extraction

Our extract_links_from_response( ) routine has to take a successful HTTP::Response object containing HTML and extract the URLs from the links in it. But in practice, "link" can be an imprecise term. Clearly, this constitutes a link:

<a href="pie.html">I like pie!</a>

But what about the area element here?

<map>
 ...
 <area shape="rect" href="pie.html" coords="0,0,80,21">
 ...
</map>

Or what about the frame element here?

<frameset rows="*,76">
 ...
 <frame src="pie.html" name="eat_it">
 ...
</frameset>

And what about the background attribute value here?

<body bgcolor="#000066" background="images/bg.gif" ... >

You will have to decide for each kind of spider task what sort of links it should be interested in and implement a different extract_links_from_response( ) accordingly. For purposes of simplicity, we'll consider only <a href="..."> tags to be links. This is easy to implement using the HTML::TokeParser approach we covered in Chapter 7, "HTML Processing with Tokens" and using the URI class we covered in Chapter 4, "URLs".

use HTML::TokeParser;
use URI;

sub extract_links_from_response {
  my $response = $_[0];

  my $base = URI->new( $response->base )->canonical;
    # "canonical" returns it in the one "official" tidy form

  my $stream = HTML::TokeParser->new( $response->content_ref );
  my $page_url = URI->new( $response->request->uri );

  mutter( "Extracting links from $page_url\n" );

  my($tag, $link_url);
  while( $tag = $stream->get_tag('a') ) {
    next unless defined($link_url = $tag->[1]{'href'});
    next if $link_url =~ m/\s/; # If it's got whitespace, it's a bad URL.
    next unless length $link_url; # sanity check!
  
    $link_url = URI->new_abs($link_url, $base)->canonical;
    next unless $link_url->scheme eq 'http'; # sanity
  
    $link_url->fragment(undef); # chop off any "#foo" part
    note_link_to($page_url => $link_url)
      unless $link_url->eq($page_url); # Don't note links to itself!
  }
  return;
}

This does lots of sanity checking on the href attribute value but ends up feeding to note_link_to( ) new (absolute) URI objects for URLs such as http://bazouki-consortium.int/ or http://www.mybalalaika.com/oggs/studio_credits.html, while skipping non-HTTP URLs such as mailto:info@mybalalaika.com, as well as invalid URLs that might arise from parsing bad HTML.

This is about as complex as our spider code gets, and it's easy from here on.

12.3.6. Fleshing Out the URL Scheduling

So far we've used a note_link_to( ) routine twice. That routine need only do a bit of accounting to update the %points_to hash we mentioned earlier and schedule this URL to be visited.

sub note_link_to {
  my($from_url => $to_url) = @_;
  $points_to{ $to_url }{ $from_url } = 1;
  mutter("Noting link\n  from $from_url\n    to $to_url\n");
  schedule($to_url);
  return;
}

That leaves routines such as schedule( ) left to write. As a reminder, three things need to be done with the schedule (as we're calling the big set of URLs that need to be visited). We need a way to see how many entries there are in it with schedule_count( ) (at least so main_loop( ) can know when it's empty). We'll need to pull a URL from the schedule with next_scheduled_url(), so main_loop() can feed it to process_url(). And we need a way to feed a URL into the schedule, with schedule($url), as called from note_link_to( ) and process_starting_urls( ).

A simple Perl array is a perfectly sufficient data structure for our schedule, so we can write schedule_count( ) like so:

my @schedule;
sub schedule_count     { return scalar @schedule }

The implementation of next_scheduled_url( ) depends on exactly what we mean by "next." If our @schedule is a proper stack, scheduling a URL means we push @schedule, $url, and next_scheduled_url( ) is just a matter of $url = pop @schedule. If our @schedule is a proper queue, then scheduling a URL means we push @schedule, $url, and next_scheduled_url( ) is just a matter of $url = shift @schedule.

Both of these approaches make our spider quite predictable, in the sense that when run on the same site, it will always do the same things in the same order. This could theoretically be an advantage for debugging, and would be a necessary feature if we were trying to debug without the benefit of the logging we've written into the spider.

However, that predictability is also a problem: if the spider happens on a page with dozens of slow-responding URLs, it could spend the rest of its life trying to check those links; i.e., until main_loop( ) quits because $hit_count reaches $hit_limit or because time( ) reaches $expiration. In practice, this problem is greatly alleviated (although not completely eliminated) by pulling URLs not from the beginning or end of @schedule, but instead from a random point in it:

sub next_scheduled_url {
  my $url = splice @schedule, rand(@schedule), 1;

  mutter("\nPulling from schedule: ", $url || "[nil]",
    "\n  with ", scalar(@schedule),
    " items left in schedule.\n");
  return $url;
}

This leaves us with the schedule($url) routine to flesh out. It would be as simple as:

sub schedule {
  my $url = $_[0];
  push @schedule, URI->new($url);
  return;
}

However, we don't do much sanity checking on URLs everywhere else, so we need to do lots of it all here. First off, we need to make sure we don't schedule a URL that we've scheduled before. Not only does this keep there from being duplicates in @schedule at any one time, it means we never process the same URL twice in any given session.

Second off, we want to skip non-HTTP URLs, because other schemes (well, except HTTPS) aren't HEADable and don't have MIME types, two things our whole spider logic depends on. Moreover, we probably want to skip URLs that have queries (http://foo.bar/thing?baz) because those are usually CGIs, which typically don't understand HEAD requests. Moreover, we probably want to skip HTTP URLs that inexplicably have userinfo components (http://joeschmo@foo.bar/thing), which are typically typos for FTP URLs, besides just being bizarre.

We also want to regularize the hostname, so we won't think http://www.Perl.com/, http://www.perl.com/, and http://www.perl.com./ are all different hosts, to be visited separately. We also want to skip URLs that are too "deep," such as http://www.foo.int/docs/docs/docs/docs/docs/docs/about.html, which are typically a sign of a wild symlink or some other similar problem. We also want to skip unqualified hostnames, such as http://www/ or http://mailhost/, and URLs with path weirdness, such as http://thing.com/./././//foo.html. Then we chop off any #foo fragment at the end of the URL, and finally add the URL to @schedule if it's new.

All that sort of sanity checking adds up to this:

my %seen_url_before;

sub schedule {
  # Add these URLs to the schedule
  foreach my $url (@_) {
    my $u = ref($url) ? $url : URI->new($url);
    $u = $u->canonical;  # force canonical form
 
    next unless 'http' eq ($u->scheme || '');
    next if defined $u->query;
    next if defined $u->userinfo;

    $u->host( regularize_hostname( $u->host( ) ) );
    return unless $u->host( ) =~ m/\./;

    next if url_path_count($u) > 6;
    next if $u->path =~ m<//> or $u->path =~ m</\.+(/|$)>;

    $u->fragment(undef);

    if( $seen_url_before{ $u->as_string }++ ) {
      mutter("  Skipping the already-seen $u\n");
    } else {
      mutter("  Scheduling $u\n");
      push @schedule, $u;
    }
  }
  return;
}

All we need is the routine that regularizes a given hostname:

sub regularize_hostname {
  my $host = lc $_[0];
  $host =~ s/\.+/\./g; # foo..com => foo.com
  $host =~ s/^\.//;    # .foo.com => foo.com
  $host =~ s/\.$//;    # foo.com. => foo.com
  return 'localhost' if $host =~ m/^0*127\.0+\.0+\.0*1$/;
  return $host;
}

then a routine that counts the number of /-separated parts in the URL path:

sub url_path_count {
  # Return 4 for "http://foo.int/fee/fie/foe/fum"
  #                               1   2   3   4
  my $url = $_[0];
  my @parts = $url->path_segments;
  shift @parts if @parts and $parts[ 0] eq '';
  pop   @parts if @parts and $parts[-1] eq '';
  return scalar @parts;
}

12.3.7. The Rest of the Code

That's a fully functioning checker-spider—at least once you add in the boring switch processing, initialize( ), and the report( ) that dumps the contents of %notable_url_error, which are as follows:

use strict;
use warnings;
use URI;
use LWP;
 
# Switch processing:
my %option;
use Getopt::Std;
getopts('m:n:t:l:e:u:t:d:hv', \%option) || usage_quit(1);
usage_quit(0) if $option{'h'} or not @ARGV;
 
sub usage_quit {
  # Emit usage message, then exit with given error code.
  print <<"END_OF_MESSAGE"; exit($_[0] || 0);
Usage:
$0  [switches]  [urls]
  This will spider for bad links, starting at the given URLs.
   
Switches:
 -h        display this help message
 -v        be verbose in messages to STDOUT  (default off)
 -m 123    run for at most 123 minutes.  (default 20)
 -n 456    cause at most 456 network hits.  (default 500)
 -d 7      delay for 7 seconds between hits.  (default 10)
 -l x.log  log to text file x.log. (default is to not log)
 -e y\@a.b  set bot admin address to y\@a.b  (no default!)
 -u Xyz    set bot name to Xyz.  (default: Verifactrola)
 -t 34     set request timeout to 34 seconds.  (default 15)
 
END_OF_MESSAGE
}
 
my $expiration = ($option{'m'} ||  20) * 60 + time( );
my $hit_limit  =  $option{'h'} || 500;
my $log        =  $option{'l'};
my $verbose    =  $option{'v'};
my $bot_name   =  $option{'u'} || 'Verifactrola/1.0';
my $bot_email  =  $option{'e'} || '';
my $timeout    =  $option{'t'} || 15;
my $delay      =  $option{'d'} || 10;
die "Specify your email address with -e\n"
  unless $bot_email and $bot_email =~ m/\@/;
 
my $hit_count = 0;
my $robot;  # the user-agent itself
 
# Then the top-level code we've already seen:
initialize( );
process_starting_urls(@ARGV);
main_loop( );
report( ) if $hit_count;
say("Quitting.\n");
exit;
 
sub initialize {
  init_logging( );
  init_robot( );
  init_signals( );
  return;
}
 
sub init_logging {
  my $selected = select(STDERR);
  $| = 1; # Make STDERR unbuffered.
  if($log) {
    open LOG, ">>$log" or die "Can't append-open $log: $!";
    select(LOG);
    $| = 1; # Make LOG unbuffered
  }
  select($selected);
  print "Logging to $log\n" if $log;
  return;
}
 
sub init_robot {
  use LWP::RobotUA;
  $robot = LWP::RobotUA->new($bot_name, $bot_email);
  $robot->delay($delay/60); # "/60" to do seconds->minutes
  $robot->timeout($timeout);
  $robot->requests_redirectable([]);
    # don't follow any sort of redirects
  $robot->protocols_allowed(['http']);  # disabling all others
  say("$bot_name ($bot_email) starting at ", scalar(localtime), "\n");
  return;
}
 
sub init_signals {  # catch control-C's
  $SIG{'INT'} = sub { $QUIT_NOW = 1; return;};
   # That might not be emulated right under MSWin.
  return;
}
 
 
sub report {  # This that gets run at the end.
  say(
    "\n\nEnding at ", scalar(localtime),
    " after ", time( ) - $^T,
    "s of runtime and $hit_count hits.\n\n",
  );
  unless(keys %notable_url_error) {
    say( "\nNo bad links seen!\n" );
    return;
  }
 
  say( "BAD LINKS SEEN:\n" );
  foreach my $url (sort keys %notable_url_error) {
    say( "\n$url\n  Error: $notable_url_error{$url}\n" );
    foreach my $linker (sort keys %{ $points_to{$url} } ) {
      say( "  < $linker\n" );
    }
  }
  return;
}

And that's all of it!



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.