Marco shows you how to read or download only the parts that interest you from a web page.
There are many web browsers and FTP clients for Linux, all rich in features and able to satisfy all users, from command-line fanatics to 3-D multiscreen desktop addicts. They all share one common defect, however: you have to be at the keyboard to drive them. Of course, fine tools like wget can mirror a whole site while you sleep, but you still have to find the right URL first, and when it's finished you must read through every bit that was downloaded anyway.
With small, static sites, it's no big deal, but what if every day you want to download a page that is given a random URL? Or what if you don't want to read 100K of stuff just to scroll a few headlines?
Enter client-side web scripting, i.e., all the techniques that allow you to spend time only looking at web pages (or parts of them) that interest you, and only after your computer found them for you. With such scripts you could read only the traffic or weather information related to your area, download only certain pictures from a web page or automatically find the single link you need.
Besides saving time, client-side web scripting lets you learn about some important issues and teaches you some self-discipline. For one thing, doing indiscriminately what is explained here may be considered copyright infringement in some cases or may consume so much bandwidth as to cause the shutdown of your internet account or worse. On the other hand, this freedom to surf is possible only as long as web pages remain in nonproprietary languages (HTML/XML), written in nonproprietary ASCII.
Finally, many fine sites can survive and remain available at no cost only if they send out enough banners, so all this really should be applied with moderation.
As usual, before doing something from scratch, one should check what has already been done and reuse it, right? A quick search on Freshmeat.net for “news ticker” returns 18 projects, from Kticker to K.R.S.S to GKrellM Newsticker.
These are all very valid tools, but they only fetch news, so they won't work without changes in different cases. Furthermore, they are almost all graphical tools, not something you can run as a cron entry, maybe piping the output to some other program.
In this field, in order to scratch only your very own itch, it is almost mandatory to write something for yourself. This is also the reason why we don't present any complete solution here, but rather discuss the general methodology.
The only prerequisites to take advantage of this article are to know enough Perl to put together some regular expressions and the following Perl modules: LWP::UserAgent, LWP::Simple, HTML::Parse, HTML::Element, URI::URL and Image::Grab. You can fetch these from CPAN (www.cpan.org). Remember that, even if you do not have the root password of your system (typically on your office computer), you still can install them in the directory of your choice, as explained in the Perl documentation and the relevant README files.
Everything in this article has been tested under Red Hat Linux 7.2, but after changing all absolute paths present in the code, should work on every UNIX system supporting Perl and the several external applications used.
All the tasks described below, and web-client scripting in general, require that you can download and store internally for further analysis the whole content of some initial web page, its last modification date, a list of all the URLs it contains or any combination of the above. All this information can be collected with a few lines of code at the beginning of each web-client script, as shown in Listing 1.
Listing 1. Collecting the Basic Information
The code starts with the almost mandatory “use strict” directive and then loads all the required Perl modules. Once that is done, we proceed to save the whole content of the web page in the $HTML_FILE variable via the get() method. With the instruction that follows, we save each line of the HTTP header in one element of the @HEADER array. Finally, we define an array (@ALL_URLS), and with a for() cycle, we extract and save inside it all the links contained in the original web page, making them absolute if necessary (with the abs() method). At the end of the cycle, the @ALL_URLS array will contain all the URLs found in the initial document.
A complete description of the Perl methods used in this code, and much more, can be found in the book Web Client Programming (see Resources).
After having collected all this material, we can start to use it. If you simply want to save the content of a web page on your disk for later reading, you have to add a print instruction to the original script:
print $HTML_FILE;
And then run it from your shell prompt:
./webscript.pl http://www.fsf.org > fsf.htmlThis will allow you to save the whole page in the local file fsf.html. Keep in mind, however, that if this is all you want, wget is a better choice (see Resources, “Downloading without a Browser”).
If all the absolute URLs are already inside the @ALL_URLS array, we can download all the images with the following for() cycle:
foreach my $GRAPHIC_URL (grep /(gif|jpg|png)$/, @ALL_URLS) { $GRAPHIC_URL =~ m/([^\/]+)$/; my $BASENAME = $1; print STDERR "SAVING $GRAPHIC_URL in $BASENAME....\n"; my $IMG = get ($GRAPHIC_URL); open (IMG_FILE, "> $BASENAME") || die "Failed opening $BASENAME\n"; print IMG_FILE $IMG; close IMG; }
The loop operates on all the URLs contained in the document ending with the .gif, .jpg or .png extension (extracted from the original array with the grep instruction). First, the regular expression finds the actual filename, defined as everything in the URL from the rightmost slash sign to the end; this should be generalized to deal with URLs hosted on those systems so twisted that even the directory separator is backward.
The result of the match is loaded in the $BASENAME variable, and the image itself is saved with the already known get() method inside $IMG. After that, we open a file with the proper name and print the whole thing inside it.
Of course, many times you will not be interested in all the images (especially because many of them usually will be advertising banners, the site logo or other uninteresting stuff). In situations like this, a simple look at the HTML source will help you figure out what sets the image you need apart from the rest. For example, you may find out that the interesting picture has a random name but is always the third one in the list. If this is the case, modify the previous loop as follows:
my $IMG_COUNT = 0; my $WANTED_IMG = 3; foreach my $GRAPHIC_URL (grep /(gif|jpg|png)$/, @ALL_URLS) { $IMG_COUNT++; next unless ($IMG_COUNT == $WANTED_IMG); # rest of loop as before..... last if ($IMG_COUNT == $WANTED_IMG); } print "FILE NOT FOUND TODAY\n" if ($IMG_COUNT != $WANTED_IMG);
The first instruction in the loop increments the image counter; the second jumps to the next iteration until we reach the third picture. The “last” instruction avoids unnecessary iterations, and the one after the loop informs that the script could not perform the copy because it found less than $WANTED_IMG pictures in the source code.
If the image name is not completely random, it's even easier because you can filter directly on it in the grep instruction at the beginning:
foreach my $GRAPHIC_URL (grep /(^daily(\d+).jpg)$/, @ALL_URLS) {
This will loop only on files whose names start with the “daily” string, followed by any number of digits (\d+) and a .jpg extension.
The two techniques can be combined at will, and much more sophisticated things are possible. If you know that the picture name is equal to the page title plus the current date expressed in the YYYYMMDD format, first extract the title:
$HTML_FILE =~ m/<TITLE>([^<]+)<\/TITLE>/; my $TITLE = $1;
Then calculate the date:
my ($sec, $min, $hour, $day, $month, $year, @dummy) = localtime(time); $month++; # months start at 0 $year += 1900; # Y2K-compliant, of course ;-))) $TODAY = $year.$month.$day;And finally, filter on this:
foreach my $GRAPHIC_URL (grep /(^$TITLE$TODAY.jpg)$/, <@>ALL_URLS) {
Now it starts to get really interesting. Customizing your script to fetch only a certain section of the web page's text usually requires more time and effort than any other operation described here because it must be done almost from scratch on each page and repeated if the page structure changes. If you have a slow internet connection, or even a fast one but cannot slow down your MP3 downloads or net games, you rapidly will recover the time spent to prepare the script. You also will save quite a bit of money, if you (like me) still pay per minute.
You have to open and study the HTML source of the original web page to figure out which Perl regular expression filters out all and only the text you need. The Perl LWP library already provides methods to extract all the text out of the HTML code. If you only want a plain ASCII version of the whole content, go for them.
You may be tempted to let the LWP library extract the whole text from the source, and then work on it, even when you only want to extract some lines from the web page. I have found this method to be much more difficult to manage in real cases, however. Of course, the ASCII formatting makes the text immediately readable to a human, but it also throws out all the HTML markup that is so useful to tell the script which parts you want to save. The easiest example of this false start is if you want to save or display all and only the news titles, and they are marked in the source with the <H1></H1> tags. Those markers are trivial to use in a Perl regular expression, but once they are gone, it becomes much harder to make the script recognize headlines.
To demonstrate the method on a real web page, let's try to print inside our terminal all the press-release titles from the FSF page at www.fsf.org/press/press.html. Pointing our script at this URL will save all its content inside the $HTML_FILE variable. Now, let's apply to it the following sequence of regular expressions (I suggest that you also look at that page and at its source code with your browser to understand everything going on):
$HTML_FILE =~ s/.*>Press Releases<//gsmi; $HTML_FILE =~ s/.*<DL>//gsmi; $HTML_FILE =~ s/<\/DL>.*$//gsmi; $HTML_FILE =~ s/<dt>([^<]*)<\/dt>/-> $1: /gi; $HTML_FILE =~ s/<dd><a href=[^>]*>([^<]*)<\/a>/ $1 /gsmi; $HTML_FILE =~ s/\.\s+\([^\)]*\.\)<\/dd>/<DD>/gsmi; $HTML_FILE =~ s/\s+/ /gsmi; $HTML_FILE =~ s/<DD>/\n/gsmi;
The first three lines cut off everything before and after the actual press-release list. The fourth one finds the date and strips the HTML tags out of it. Regexes number five and six do the same thing to the press-release subject. The last two eliminate redundant white spaces and put new lines where needed. As of December 14, 2001, the output at the shell prompt looks like this (titles have been manually cut by me for better formatting):
-> 3 December 2001: Stallman Receives Prestigious... -> 22 October 2001: FSF Announces Version 21 of the... -> 12 October 2001: Free Software Foundation Announces... -> 24 September 2001: Richard Stallman and Eben Moglen... -> 18 September 2001: FSF and FSMLabs come to agreement...The set of regular expressions above is not complete; for one thing, it doesn't manage news with update sections. One also should make it as independent as possible from extra spaces inside HTML tags or changes in the color or size of some fonts. This regular expression strips out all the font markup:
$HTML_FILES =~ s/<font face="Verdana" size="3"> ([^<]+)<\/font>/$1/g;This performs the same task but works on any font type and (positive) font size:
$HTML_FILES =~ s/<font face="[^"]+" size="\d+"> ([^<]+)<\/font>/$1/g;The example shown here, however, still is detailed enough to show the principle, and again the one-time effort to write a custom set for any given page really can save a lot of time.
Once you have managed to extract the text you want and to format it to your taste, there is no reason to limit yourself to a manual use of the script, or to use it only at the console for that matter. If you want to do something else and be informed by the computer only when a new headline about Stallman appears, only three more steps are needed.
First, put the script among your cron entries (man cron will tell you everything about this). After that, add the following check to your Perl script:
if ($HTML_FILE =~ m/Stallman/) { # INFORM ME!!! }
This will do what you want only if the remaining text does contain the Stallman string (or whatever else you want to know about, of course).
Next, fill the block with something like this:
open (XMSG, "|/usr/bin/X11/xmessage -title \"NEWS!\" -file -") || die; print XMSG $HTML_FILE; close XMSG;
This will open a UNIX pipe to the xmessage program, which pops up a window with the title given with the corresponding switch and containing the text of the file following the -file option. In our case, “-” tells xmessage to get the text from the standard input. As it is, the Perl script will wait to exit, so that you close the xmessage window. This may or may not be what you want. In the case of a cron script, it's much better to let it start xmessage in the background on a temporary file and exit, like this:
open (XMSG, "> /tmp/gee") || die; print XMSG $HTML_FILE; close XMSG; exec "/usr/bin/X11/xmessage -title \"NEWS!\" -file /tmp/gee&";
If you want to process the page only if the content was changed since the last visit, or in the last two hours, you need the Last-Modified HTTP header. It is already available, expressed in seconds since January 1, 1970, in the third element of our @HEADER array. Hence, if you want to do something only on pages modified in the last two hours, start calculating what the time was in that moment (always in the “seconds since...” unit):
$NOW = time; $TWO_HOURS_AGO = $NOW - (3600*2);
Then compare that time with the modification date of the web page:
if ($HEADER[2] > $TWO_HOURS_AGO) { # do whatever is needed }
This is one of the rare exceptions to the do-it-yourself rule stated at the beginning: download WMHeadlines (see Resources), install it, and then configure and modify to suit your taste. Out of the box, it can fetch headlines from more than 120 sites and place them in the root menu of Blackbox, WindowMaker, Enlightenment and GNOME in such a way that you start your browser on the dynamic menu voice you click on.
Netscape can be given several commands from the prompt or any script. Such commands will cause Netscape to start if it wasn't already running or will load the requested URL in the current window, or even in a new one. However, the commands to run change depending on whether Netscape is already running. Look at the nslaunch.pl script in the WMHeadlines distribution to figure out how to check if Netscape is already running.
You also can drive Netscape to perform other actions from a script: to print a page just as Netscape would do if driven manually, make it load the page first:
exec($NETSCAPE, '-noraise', '-remote', "openURL($URL,new-window)");
Then save it as PostScript:
exec($NETSCAPE, '-noraise', '-remote', "saveAs(/tmp/netscape.ps, PostScript)");And finally, print it:
exec("mpage -PYOURPRINTER -1 /tmp/netscape.ps");Or, even add it to the bookmarks:
exec($NETSCAPE, '-noraise', '-remote', "addBookmark($SOME_URL, $ITS_TITLE)");Konqueror, the KDE web browser, can be started simply by invoking it in this way:
system("/usr/bin/konqueror $URL");Konqueror can be driven by scripts for many nonweb-related tasks, such as copying files, starting applications and mounting devices. Type kfmclient --commands for more details.
Galeon can be started in an almost equal way:
system("/usr/bin/galeon $URL");
As explained in A User's Guide to Galeon (see Resources), you also can decide whether Galeon (if already running) should open the URL in a new tab:
system("/usr/bin/galeon -n $URL");in a new window:
system("/usr/bin/galeon -w $URL");or temporarily bookmark the $URL:
system("/usr/bin/galeon -t $URL");
The opposite approach, i.e., starting a generic mirroring or image-fetching script from your browser, is possible in Konqueror (or even KMail) during normal browsing. If you right click on a link and select the “Open with..” option, it will let you enter the path of the script to be used and add it to the choices next time. This means you can prepare a mirror or fetch_images script following the instructions given here and start it in the background on any URL you wish with a couple of clicks.
The URL list contained in the @ALL_URLS array also can be used to start mirroring or (parallel) FTP sessions. This can be done entirely in Perl, using the many FTP and mirroring modules available, or simply by collecting the URLs to be mirrored or fetched by FTP, and leaving the actual work to wget or curl, as explained in A. J. Chung's article, “Downloading without a Browser” (see Resources).
If your favorite web portal chooses a different cool site every day, and you want your PC to mirror it for you, just fetch the URL as you would do for images, and then say in your script:
exec "wget -m -L -t 5 $COMPLETE_URL";
All the commands for parallel FTP and mirroring explained in Chung's article can be started in this way from a Perl script, having as arguments the URLs found by this one.
Many of us have more than one favorite site and would like to have them all in the same window. A general solution for this is to extract the complete HTML body of each page in this way:
$HTML_FILE = s/^.*<body[^>]*>//i; # strips everything # before $HTML_FILE = s/<\/body[^>]*>.*$//i; # strips everything # after
and then print out an HTML table with each original page in each box:
print<<END_TABLE; ....All HTML <HEAD> and <BODY> stuff here <TABLE> <TR><TD>$HTML_FILE_1</TD></TR> <TR><TD>$HTML_FILE_2</TD></TR> ......... </TABLE></BODY></HTML> END_TABLESave the script output in $HOME/.myportal.html, set that file as your starting page in your browser and enjoy! The complete script will probably require quite some tweaking to clean up different CSSes, fonts and so on, but you know how to do it by now, right?
We have barely scratched the surface of client-side web scripting. Much more sophisticated tasks are possible, such as dealing with cookies and password-protected sites, automatic form submission, web searches with all the criteria you can think about, scanning a whole web site and displaying the ten most-pointed-to URLs in a histogram, and web-mail checking.
You only need some patience, Perl practice and a good knowledge of the relevant modules to succeed. Good browsing!