Book HomeCGI Programming with PerlSearch this book

Chapter 11. Maintaining State

Contents:

Query Strings and Extra Path Information
Hidden Fields
Client-Side Cookies

HTTP is a stateless protocol. As we discussed in Chapter 2, "The Hypertext Transport Protocol ", the HTTP protocol defines how web clients and servers communicate with each other to provide documents and resources to the user. Unfortunately, as we noted in our discussion of HTTP (see Section 2.5.1, "Identifying Clients"), HTTP does not provide a direct way of identifying clients in order to keep track of them across multiple page requests. There are ways to track users through indirect methods, however, and we'll explore these methods in this chapter. Web developers refer to the practice of tracking users as maintaining state . The series of interactions that a particular user has with our site is a session . The information that we collect for a user is session information.

Why would we want to maintain state? If you value privacy, the idea of tracking users may raise concerns. It is true that tracking users can be used for questionable purposes. However, there are legitimate instances when you must track users. Take an online store: in order to allow a customer to browse products, add some to a shopping cart, and then check out by purchasing the selected items, the server must maintain a separate shopping cart for each user. In this case, collecting selected items in a user's session information is not only acceptable, but expected.

Before we discuss methods for maintaining state, let's briefly review what we learned earlier about the HTTP transaction model. This will provide a context to understand the options we present later. Each and every HTTP transaction follows the same general format: a request from a client followed by a response from the server. Each of these is divided into a request/response line, header lines, and possibly some message content. For example, if you open your favorite browser and type in the URL:

http://www.oreilly.com/catalog/cgi2/index.html

Your browser then connects to www.oreilly.com on port 80 (the default port for HTTP) and issues a request for /catalog/cgi2/index.html. On the server side, because the web server is bound to port 80, it answers any requests that are issued through that port. Here is how the request would look from a browser supporting HTTP 1.0:

GET /index.html HTTP/1.0
Accept: image/gif, image/x-xbitmap, image/jpeg, image/png, */*
Accept-Language: en
Accept-Charset: iso-8859-1,*,utf-8
User-Agent: Mozilla/4.5 (Macintosh; I; PPC)

The browser uses the GET request method to ask for the document, specifies the HTTP protocol to use, and supplies a number of headers to pass information about itself and the format of the content it will accept. Because the request is sent via GET and not POST, the browser is not passing any content to the server.

Here is how the server would respond to the request:

HTTP/1.0 200 OK
Date: Sat, 18 Mar 2000 20:35:35 GMT
Server: Apache/1.3.9 (Unix)
Last-Modified: Wed, 20 May 1998 14:59:42 GMT
Content-Length: 141
Content-Type: text/html

(content)
...

In Version 1.0 of HTTP, the server returns the requested document and then closes the connection. Yes, that's right: the server doesn't keep the connection open between itself and the browser. So, if you were to click on a link on the returned page, the browser then issues another request to the server, and so on. As a result, the server has no way of knowing that it's you that is requesting the successive document. This is what we mean by stateless , or nonpersistent; the server doesn't maintain or store any request-related information from one transaction to the next. You do know the network address of the client who is connecting to you, but as you'll recall from our earlier discussion of proxies (see Section 2.5, "Proxies"), multiple users may be making connections via the same proxy.

You may be waiting to hear what's changed in Version 1.1 of HTTP. In fact, a connection may remain open across multiple requests, although the request and response cycle is the same as above. However, you cannot rely on the network connection remaining open since the connection can be closed or lost for any number of reasons, and in any event CGI has not been modified to allow you access any information that would associate requests made across the same connection. So in HTTP 1.1 as in HTTP 1.0, the job of maintaining state falls to us.

Consider our shopping cart example: it should allow consumers to navigate through many pages and selectively place items in their carts. A consumer typically places an item in a cart by selecting a product, entering the desired quantity, and submitting the form. This action sends the data to the web server, which, in turn, invokes the requested CGI application. To the server, it's simply another request. So, it's up to the application to not only keep track of the data between multiple invocations, but also to identify the data as belonging to a particular consumer.

In order to maintain state, we must get the client to pass us some unique identifier with each request. As you can see from the HTTP request example earlier, there are only three different ways the client can pass information to us: via the request line, via a header line, or via the content (in the case of a POST request). Thus, in order to maintain state, we can have the client pass a unique identifier to us via any of these methods. In fact, the techniques we'll explore will cover all three of these ways:

Query strings and extra path information

It's possible to embed an identifier in the query string or as extra path information within a document's URL. As users traverse through a site, a CGI application generates documents on the fly, passing the identifier from document to document. This allows us to keep track of all the documents requested by each user, and in the order in which they were requested. The browser sends this information to us via the request line.

Hidden fields

Hidden form fields allow us to embed "invisible" name-value information within forms that the user cannot see without viewing the source of the HTML page. Like typical form fields and values, this information is sent to the CGI application when the user presses the submit button. We generally use this technique to maintain the user's selections and preferences when multiple forms are involved. We'll also look at how CGI.pm can do much of this work for us. The browser sends this information to us via the request line or via the message content depending on whether the request was GET or POST, respectively.

Client-side cookies

All modern browsers support client-side cookies, which allow us to store information on the client machine and have it pass it back to us with each request. We can use this to store semi-permanent data on the client-side, which will be available to us whenever the user requests future resources from the server. Cookies are sent back to us by the client in the Cookie HTTP header line.

The advantages and disadvantages of each technique are summarized in Table 11-1. We will review each technique separately, so if some of the points in the table are unclear you may want to refer back to this table after reading the sections below. In general, though, you should note that client-side cookies are the most powerful option for maintaining state, but they require something from the client. The other options work regardless of the client, but both have limits in the number of the pages that we can track the user across.

Table 11-1. Summary of the Techniques for Maintaining State

Technique

Scope

Reliability and Performance

Client Requirements

Query strings and extra path information

Can be configured to apply to a particular group of pages or an entire web site, but state information is lost if the user leaves the web site and later returns

Difficult to reliably parse all links in a document;

significant performance cost to pass static content through CGI scripts

Does not require any special behavior from the client

Hidden fields

Only works across a series of form submissions

Easy to implement; does not affect performance

Does not require any special behavior from the client

Cookies

Works everywhere, even if the user visits another site and later returns

Easy to implement; does not affect performance

Requires that the client supports (and accepts) cookies

11.1. Query Strings and Extra Path Information

We've passed query information to CGI applications many times throughout this book. In this section, we'll use queries in a slightly less obvious manner, namely to track a user's browsing trail while traversing from one document to the next on the server.

In order to do this, we'll have a CGI script handle every request for a static HTML page. The CGI script will check whether the request URL contains an identifier matching our format. If it doesn't, the script assumes that this is a new user and generates a new identifier. The script then parses the requested HTML document by looking for links to other URLs within our web site and appending a unique identifier to each URL. Thus, the identifier will be passed on with future requests and propagated from document to document. Of course, if we want to track users across CGI applications then we'll also need to parse the output of these CGI scripts. The simplest way to accomplish both goals is to create a general module that handles reading the identifier and parsing the output. This way, we need to write our code only once and can have the script for our HTML pages as well as allow all our other CGI scripts share it.

As you may have guessed, this is not a very efficient process, since a request for each and every HTML document triggers a CGI application to be executed. Tools such as mod_perl and FastCGI, discussed in Chapter 17, "Efficiency and Optimization", help because both of these tools effectively embed the Perl interpreter into the web server.

Another strategy to help improve performance is to perform some processing in advance. If you are willing to preprocess your documents, you can reduce the amount of work that happens when the customer accesses the document. The majority of the work involved in parsing a document and replacing links is identifying the links. HTML::Parser is a good module, but the work it does is rather complex. If you parse the links and add a special keyword instead of one for a particular user, then later you can look for this keyword and not have to worry about recognizing links. For example, you could parse URLs and add #USERID# as the identifier for each document. The resulting code becomes much simpler. You can effectively handle documents this way:

sub parse {
    my( $filename, $id ) = @_;
    local *FH;
    open FH, $filename or die "Cannot open file: $!";
    
    while (<FH>) {
        s/#USERID#/$id/g;
        print;
    }
}

However, when a user traverses through a set of static HTML documents, CGI applications are typically not involved. If that's the case, how do we pass session information from one HTML document to the next, and be able to keep track of it on the server?

The answer to our problem is to configure the server such that when the user requests an HTML document, the server executes a CGI application. The application would then be responsible for transparently embedding special identifying information (such as a query string) into all the hyperlinks within the requested HTML document and returning the newly created content to the browser.

Let's look at how we're actually going to implement the application. It's only a two-step process. To reiterate, the problem we're trying to solve is to determine what documents a particular user requests and how much time he or she spends viewing them. First, we need to identify the set of documents for which we want to obtain the users' browsing history. Once we do that, we simply move these documents to a specific directory under the web server's document root directory.

Next, we need to configure the web server to execute a CGI application each and every time a user requests a document from this directory. We'll use the Apache web server for this example, but the configuration details are very similar for other web servers, as well.

We simply need to insert the following directives into Apache's access configuration file, access.conf:

<Directory /usr/local/apache/htdocs/store>
    AddType text/html   .html
    AddType Tracker     .html
    Action  Tracker     /cgi/track.cgi
</Directory>

When a user requests a document from the /usr/local/apache/htdocs/store directory, Apache executes the query_track application, passing to it the relative URL of the requested document as extra path information. Here's an example. When the user requests a document from the directory for the first time:

http://localhost/store/index.html

the web server will execute query_track, like so:

http://localhost/cgi/track.cgi/store/index.html

The application uses the PATH_TRANSLATED environment variable to get the full path of index.html. Then, it opens the file, creates a new identifier for the user, embeds it into each relative URL within the document, and returns the modified HTML stream to the browser. In addition, we log the transaction to a special log file, which you can use to analyze users' browsing habits at a later time.

If you're curious as to what a modified URL looks like, here's an example:

http://localhost/store/.CC7e2BMb_H6UdK9KfPtR1g/faq.html

The identifier is a modified Base64 MD5 message digest, computed using various pieces of information from the request. The code to generate it looks like this:

use Digest::MD5;

my $md5 = new Digest::MD5;
my $remote = $ENV{REMOTE_ADDR} . $ENV{REMOTE_PORT};
my $id = $md5->md5_base64( time, $$, $remote );
$id =~ tr|+/=|-_.|;  # Make non-word chars URL-friendly

This does a good job of generating a unique key for each request. However, it is not intended to create keys that cannot be cracked. If you are generating session identifiers that provide access to sensitive data, then you should use a more sophisticated method to generate an identifier.

If you use Apache, you do not have to generate a unique identifier yourself if you build Apache with the mod_unique_id module. It creates a unique identifier for each request, which is available to your CGI script as $ENV{UNIQUE_ID}. mod_unique_id is included in the Apache distribution but not compiled by default.

Let's look at how we could construct code to parse HTML documents and insert identifiers. Example 11-1 shows a Perl module that we use to parse the request URL and HTML output.

Example 11-1. CGIBook::UserTracker.pm

#!/usr/bin/perl -wT

#/----------------------------------------------------------------
# UserTracker Module
# 
# Inherits from HTML::Parser
# 
# 

package CGIBook::UserTracker;

push @ISA, "HTML::Parser";

use strict;
use URI;
use HTML::Parser;

1;


#/----------------------------------------------------------------
# Public methods
# 

sub new {
    my( $class, $path ) = @_;
    my $id;
    
    if ( $ENV{PATH_INFO} and
         $ENV{PATH_INFO} =~ s|^/\.([a-z0-9_.-]*)/|/|i ) {
        $id = $1;
    }
    else {
        $id ||= unique_id(  );
    }
    
    my $self = $class->SUPER::new(  );
    $self->{user_id}    = $id;
    $self->{base_path}  = defined( $path ) ? $path : "";
        
    return $self;
}

sub base_path {
    my( $self, $path ) = @_;
    $self->{base_path} = $path if defined $path;
    return $self->{base_path};
}

sub user_id {
    my $self = shift;
    return $self->{user_id};
}


#/----------------------------------------------------------------
# Internal (private) subs
# 

sub unique_id {
    # Use Apache's mod_unique_id if available
    return $ENV{UNIQUE_ID} if exists $ENV{UNIQUE_ID};
    
    require Digest::MD5;
    
    my $md5 = new Digest::MD5;
    my $remote = $ENV{REMOTE_ADDR} . $ENV{REMOTE_PORT};
    
    # Note this is intended to be unique, and not unguessable
    # It should not be used for generating keys to sensitive data
    my $id = $md5->md5_base64( time, $$, $remote );
    $id =~ tr|+/=|-_.|;  # Make non-word chars URL-friendly
    return $id;
}

sub encode {
    my( $self, $url ) = @_;
    my $uri  = new URI( $url, "http" );
    my $id   = $self->user_id(  );
    my $base = $self->base_path;
    
    my $path = $uri->path;
    $path =~ s|^$base|$base/.$id| or
        die "Invalid base path configured\n";
    $uri->path( $path );
    
    return $uri->as_string;
}


#/----------------------------------------------------------------
# Subs to implement HTML::Parser callbacks
# 

sub start {
    my( $self, $tag, $attr, $attrseq, $origtext ) = @_;
    my $new_text = $origtext;
    
    my %relevant_pairs = (
        frameset    => "src",
        a           => "href",
        area        => "href",
        form        => "action",
# Uncomment these lines if you want to track images too
#        img         => "src",
#        body        => "background",
    );
    
    while ( my( $rel_tag, $rel_attr ) = each %relevant_pairs ) {
        if ( $tag eq $rel_tag and $attr->{$rel_attr} ) {
            $attr->{$rel_attr} = $self->encode( $attr->{$rel_attr} );
            my @attribs = map { "$_=\"$attr->{$_}\"" } @$attrseq;
            $new_text = "<$tag @attribs>";
        }
    }
    
    # Meta refresh tags have a different format, handled separately
    if ( $tag eq "meta" and $attr->{"http-equiv"} eq "refresh" ) {
        my( $delay, $url ) = split ";URL=", $attr->{content}, 2;
        $attr->{content} = "$delay;URL=" . $self->encode( $url );
        my @attribs = map { "$_=\"$attr->{$_}\"" } @$attrseq;
        $new_text = "<$tag @attribs>";
    }
    
    print $new_text;
}

sub declaration {
    my( $self, $decl ) = @_;
    print $decl;
}

sub text {
    my( $self, $text ) = @_;
    print $text;
}

sub end {
    my( $self, $tag ) = @_;
    print "</$tag>";
}

sub comment {
    my( $self, $comment ) = @_;
    print "<!--$comment-->";
}

Example 11-2 shows the CGI application that we use to process static HTML pages.

Example 11-2. query_track.cgi

#!/usr/bin/perl -wT

use strict;
use CGIBook::UserTracker;

local *FILE;
my $track = new CGIBook::UserTracker;
$track->base_path( "/store" );

my $requested_doc = $ENV{PATH_TRANSLATED};
unless ( -e $requested_doc ) {
    print "Location: /errors/not_found.html\n\n";
}

open FILE, $requested_doc or die "Failed to open $requested_doc: $!";

my $doc = do {
    local $/ = undef;
    <FILE>;
};

close FILE;

# This assumes we're only tracking HTML files:
print "Content-type: text/html\n\n";
$track->parse( $doc );

Once we have inserted the identifier into all the URLs, we simply send the modified content to the standard output stream, along with the content header.

Now that we've looked at how to maintain state between views of multiple HTML documents, our next step is to discuss persistence when using multiple forms. An online store, for example, is typically broken into multiple pages. We need to able to identify users as they fill out each page. We'll look at techniques for solving such problems in the next section.



Library Navigation Links

Copyright © 2001 O'Reilly & Associates. All rights reserved.