How to let the site visitor know which documents he hasn't seen.
Like many people, I spend a great deal of time on the Web. Some of that time is spent working—writing and debugging programs for various clients. I also spend quite a bit of time reading on the Web, keeping track of the latest news from the real world and the computer industry, and even exploring new sites that friends and colleagues have suggested I visit.
A common feature on web sites, one which never fails to annoy me, is the proliferation of graphics indicating which items are new. I don't mind the fact that the site's author is letting me know the most recently changed or added items. Rather, it bothers me to know that these tags indicate whether the document is new, rather than whether the document is new for me.
When I visit a site for the first time, all of the documents should have a “new” indication, since all are new to me. When I return to the site, only those added since my previous visit should have the “new” graphic, perhaps including those modified since my last visit. In other words, the site should keep track of my usage patterns, rather than force me to remember whether I have read a particular file.
This month, we will take a look at this problem. Not only will we see how to create a web site that fails to annoy me in this particular way, but we will also look at some of the trade-offs that often occur when trying to handle site maintenance, service to the end user and program maintainability.
Now that I have disparaged the practice of putting “new” labels on a web site's links, let me demonstrate it, so we can have a clear starting point. Here is a simple page of HTML with two links on it, one with a “new” graphic next to it:
<HTML> <Head><Title>Welcome to My Site</Title></Head> <Body> <H1>Welcome to My Site</H1> <P>Read <a href="resume.html">my resume.</a></P> <P>Read <a href="deathvalley.html"><img src="new.gif"> about my recent trip to Death Valley!</a></P> </Body> </HTML>
When the page's author decides enough time has passed, the “new” logo will go down. These labels are updated by modifying the HTML file, inserting or erasing the graphics as necessary.
This technique has a number of advantages, the main one being that the site requires less horsepower to run. Downloading text and graphics does not require as much of the server's processor as a CGI program, which requires additional memory as well as processing time.
However, this technique also has many disadvantages. First of all, the labels change only when the webmaster decides to modify the HTML file, rather than on an automatic basis. Secondly, the labels fail to take users' individual histories into account, meaning first-time users will see the same “new” labels as daily visitors.
How can we approach this problem? Let's begin with a simple solution that does not use personalization, but does provide more accurate labels than the above approach. We can auto-expire the labels, printing “new” during the first week a file is made available and “modified” the second week. Files more than two weeks old will not have a label.
The easiest way to do this is via server-side includes. SSIs execute as if they were CGI programs, but their output is inserted inside an HTML file. SSIs are useful when you want dynamic or otherwise programmable text inside an HTML file, but don't have enough dynamic output to justify burying the HTML inside a CGI program.
In this particular case, we can take advantage of Apache's advanced server-side include functionality, which allows us to execute a CGI program and insert its output into an HTML file. For example, we can slightly modify our file like this:
<HTML> <Head><Title>Welcome to My Site</Title></Head> <Body> <H1>Welcome to My Site</H1> <P>Read <a href="resume.html">my resume.</a></P> <P>Read <a href="deathvalley.html"> <!-#include virtual="/cgi-bin/print-label.pl?deathvalley.html" -> about my recent trip to Death Valley!</a></P> </Body> </HTML>
As you can see, the second link includes an SSI. One nice thing about SSIs is they look like HTML comments, so if you accidentally install an SSI-enabled file on a server that does not know how to parse them, the entire SSI will be ignored.
SSIs work thanks to a bit of magic: before the document is returned to the user's browser, it is interpreted by the server (hence the term “server-side includes”). Apache replaces all of the SSI commands with the result of their execution. This could mean printing something as simple as a file's modification date, but might be as complicated as inserting the results of a large database-access client invoked via CGI.
In the above example, we run the CGI program print-label.pl, the code for which is in Listing 1. While this program is run via SSI rather than a pure CGI call, it works just like a CGI program. We use CGI.pm, the standard Perl module for writing CGI programs, to retrieve the keywords parameter, which is another way of describing a parameter passed via the GET method following the question mark.
Once we have checked to make sure the file exists, we use the -M operator to ask Perl to tell us the number of days which have passed since the file was last modified . If $ctime is equal to less than 7, the file was modified within the last seven days, meaning the file should be considered “new” for our purposes. We use a font tag to tell the user that the file is new.
If we use SSI with each link on our site, the “New!” message will appear for all links less than one week old.
I considered several ways of handling errors within print-label.pl, including using Perl's die function to exit prematurely and print an error message on the user's screen. In the end, I decided the program should exit silently if the file does not exist, or if no file name is specified at all. You may wish to send a message to the error log, which can be accomplished from within a CGI program by printing to STDERR as follows:
print STDERR "No such file \"$filename\"\n";
A major problem with this arrangement is that CGI programs are inherently resource hogs. If we have ten links on a page, using this technique involves running ten CGI programs—which means launching ten new Perl processes each time we view this page. For now, we will ignore the performance implications and focus on how to get things working. I will discuss performance toward the end of this article and in greater depth next month.
The above technique is a good start, but it still ignores the user's perspective. That is, the links are expired on an absolute time scale. But a user who visits the site less than once per week will see too few “new” labels, while someone who visits it more often than once per week will see too many “new” labels.
How can we take care of that situation? One way is to keep track of when the user last visited our site, and make the comparison to that time stamp rather than to the file's creation or modification date. How can we know when the user last visited our site? Since HTTP, the protocol used to transport most Web documents, is “stateless”, each transaction takes place in a vacuum. When a web browser makes a request to a server, the request is not connected to any previous or following request. No information about previous requests is passed along, and nothing we do in our request is saved for later ones.
The best and easiest way is to use HTTP cookies, which nearly every browser supports. Cookies are variables set by the server and stored on the client's computer. Cookies allow us to track state across transactions by storing information on the user's computer. When the server next encounters the user, it can compare the time stamp on the cookie with the time stamp on the file.
Thus, we can rewrite the above program so that it auto-expires the labels based on when the user last visited the site. Each time the user visits our site, we set a cookie. The cookie's expiration date is set to be one week in the future, meaning that if the cookie exists, this user visited our site within the last week. Our labeling program (Listing 2, print-label.pl) then has a simple way to determine whether it should print “new” next to a link—the label should be printed only if the cookie does not exist.
Listing 2. print-label.pl with Cookie Check
Because we are using CGI.pm, which includes all necessary functionality for writing CGI programs, we can check whether the cookie exists in this way:
my $visited_recently = $query->cookie('RecentVisitor');
We can then print the label with the following code:
if (!$visited_recently) { print "<font color=\"red\">(New!)</font>\n"; }
That about does it for reading the cookie. But how do we write the cookie? This is a stickier problem, one which has a number of potential solutions. The cookie specification requires that an expiration date be written with a full UNIX-style time and date stamp, as in
Thu Apr 8 02:25:30 IDT 1999
We cannot simply create and send a cookie with an expiration of “one week in the future”. We also have to figure out a way to set the cookie from within our HTML file—unless we want to use a CGI program to send the text, which would defeat the purpose of using SSIs to begin with.
One solution, although admittedly not the most elegant or efficient one, is to take advantage of the META tag supported by standard HTML. META tags have a number of uses, among them the ability to send data that would otherwise be sent in an HTTP header.
Since HTTP cookies are sent as part of the header in the browser's HTTP request, it's possible to set the “RecentVisitor” with the following HTML at the top of our page, within the <Head> section:
<META HTTP-EQUIV="Set-Cookie" CONTENT="RecentVisitor=1;expires=Thu Apr 15 02:19:17 1999; path=/">
This tells the browser it should pretend a Set-Cookie HTTP header was sent from the server, and the content attribute should be handled as if it were the header's value. That is, the above META tag sets the RecentVisitor cookie to 1 and allows the cookie to be anywhere in my domain. The cookie is set to expire on April 15, 1999.
Creating this META tag is a bit difficult, since the date depends on when the user loads the page. If the user loads the page on April 8, the cookie should be set to expire on April 15. If the user loads the page on April 10, the cookie should expire on April 17. We need to modify the output according to when the user visits.
The fact that the cookie's expiration date must change with time means we need to insert a program somewhere. The easiest way to do this is with another program invoked via SSI, which will create the META tag for us. Such a program, send-cookie.pl, is shown in Listing 3. With that installed and in place, we can say
<!-#include virtual="/cgi-bin/send-cookie.pl" ->
Our program, send-cookie.pl, sets the cookie's value by creating a META tag based on when the user accesses it. With this in place, each visit to our site will produce a cookie that disappears (or “crumbles”, if you prefer) within one week. Our SSI checks to see whether that cookie was sent, and if it was, prints an appropriate “new” label.
The above approach has two major problems, one having to do with the user interface and the second with performance.
Let's address the user interface issue first. In short, what happens if the user reloads the page? The first time he viewed the page, the cookie was set with the META tag, regardless of whether the cookie had been set before. The next time the user loads the page, even if it is just a few seconds later, the “new” labels no longer appear, because the cookie has been set, indicating the user visited the site within the last week. We need a finer-grained method for keeping track of these labels.
The second is a more serious problem—the performance hit. In order to implement this solution, we need to invoke at least two CGI programs for each document on our system. Given how resource-hungry even the most innocent CGI programs can be, particularly when written in Perl, this adds a tremendous load to the web server. Add to this the time it takes to start up a Perl process and execute such an external program, and our users will suffer as well, unless we make a significant investment in hardware.
We can solve the user interface problem with the Text::Template module, written by Mark-Jason Dominus and recently re-released as version 1.20. This module, as is the case for most modules, is available from CPAN (see Resources) or by using the CPAN module that comes with modern installations of Perl.
Text::Template allows us to mix Perl and HTML within a file. Everything within curly braces, {}, is considered to be a Perl program. The results of the block's evaluation are inserted into the document in place of its code block. Thus, if we say
<P>This is a first paragraph.</P> <P>{ 2 + 5; }</P> <P>This is a second paragraph.</P>
the end user will see
This is a first paragraph. 7 This is a second paragraph.on his or her screen.
Remember, the result of evaluating a block is not the output from that block, but rather the return result from the final line in the block. So if we say:
<P>This is a first paragraph.</P> <P>{ print 2 + 5;}</P> <P>This is a second paragraph.</P>
we will see
7 This is a first paragraph. 1 This is a second paragraph.The “7” comes from evaluating “print”, while the “1” is the returned value from the final line of the embedded Perl block.
In order to use Text::Template, we will need to write a small CGI program that invokes the module and parses the indicated file. The program template.pl, shown in Listing 4, does the trick simply and easily. If we install it in our CGI directory, we can then go to /cgi-bin/template.pl?file.tmpl, and the template file.tmpl will be interpreted by template.pl, then returned to the user's browser.
In order to deal with potential security problems from people specifying unusual file names, we remove any occurrences of the string “../” and ensure all file names start in the directory /usr/local/apache/share/templates/. You may want to define a different templates directory on your system.
Now that we have our templating system in place, we can rewrite our template cookie, in which contents and “new” labels are printed only when necessary. The final result is shown in Listing 5.
We create the dynamic META tag with the following code:
<META HTTP-EQUIV="Set-Cookie" CONTENT="RecentVisitor=1; expires={scalar localtime(time + 604800}; path=/">
As you can see, this META tag contains a small Perl block that returns an appropriate expiration date. The date is set to be 604,800 seconds in the future, better known as “one week from today”.
We retrieve the cookie later in the template, just before deciding whether to print a “new” tag:
use CGI; my $query = new CGI; my $visited_recently = $query->cookie('RecentVisitor'); $outputstring .= "<font color=\"red\">(New!)</font>\n" unless $visited_recently; $outputstring;
Notice how we can import the CGI module within a block of the template. We can then create an instance of CGI and use it to retrieve one or more cookies. We don't use CGI.pm to print output to the user's browser, since that will be done by the templating system.
It would seem that my obsession with “new” labels has led us in all sorts of new and interesting directions. This month, we looked at cookies, server-side includes, CGI programs and HTML/Perl templates. While templates did reduce the load on the server somewhat, they still require the invocation of a CGI program, which is inherently more costly than serving a straight HTML file.
One solution to this problem is to make the labeling an inherent part of our web server. If the server could keep track of the cookies and the labeling, things would work fine. Most people don't want to mess around with their web server software to that degree; Apache might be free software that allows you to mess around with the source, but few of us are that daring.
However, as we have seen in previous installments of ATF, we can easily modify parts of the server with Perl, rather than with C, by installing the mod_perl module. While such a system still requires some code for each retrieved document, the overhead for running a Perl subroutine to Apache via mod_perl is much lower than that required for an external CGI program.
Next month, we will examine a mod_perl module that goes through the links on a page and adds a “new” label for each item new to the user accessing the site. When we're done, we will have made the web a bit better and easier for pedants like me and for users who should not have to remember when they last visited a site.
All listings referred to in this article are available by anonymous download in the file ftp.linuxjournal.com/pub/lj/listings/issue63/3473.tgz.