Book HomeCGI Programming with PerlSearch this book

5.2. Handling Input with CGI.pm

CGI.pm primarily handles two separate tasks: it reads and parses input from the user, and it provides a convenient way to return HTML output. Let's first look at how it collects input.

5.2.1. Environment Information

CGI.pm provides many methods to get information about your environment. Of course, when you use CGI.pm, all of your standard CGI environment variables are still available in Perl's %ENV hash, but CGI.pm also makes most of these available via method calls. It also provides some unique methods. Table 5-1 shows how CGI.pm's functions correspond to the standard CGI environment variables.

Table 5-1. CGI.pm Environment Methods and CGI Environment Variables

CGI.pm Method

CGI Environment Variable

auth_type

AUTH_TYPE

Not available

CONTENT_LENGTH

content_type

CONTENT_TYPE

Not available

DOCUMENT_ROOT

Not available

GATEWAY_INTERFACE

path_info

PATH_INFO

path_translated

PATH_TRANSLATED

query_string

QUERY_STRING

remote_addr

REMOTE_ADDR

remote_host

REMOTE_HOST

remote_ident

REMOTE_IDENT

remote_user

REMOTE_USER

request_method

REQUEST_METHOD

script_name

SCRIPT_NAME

self_url

Not available

server_name

SERVER_NAME

server_port

SERVER_PORT

server_protocol

SERVER_PROTOCOL

server_software

SERVER_SOFTWARE

url

Not available

Accept

HTTP_ACCEPT

http("Accept-charset")

HTTP_ACCEPT_CHARSET

http("Accept-encoding")

HTTP_ACCEPT_ENCODING

http("Accept-language")

HTTP_ACCEPT_LANGUAGE

raw_cookie

HTTP_COOKIE

http("From")

HTTP_FROM

virtual_host

HTTP_HOST

referer

HTTP_REFERER

user_agent

HTTP_USER_AGENT

https

HTTPS

https("Cipher")

HTTPS_CIPHER

https("Keysize")

HTTPS_KEYSIZE

https("SecretKeySize")

HTTPS_SECRETKEYSIZE

Most of these CGI.pm methods take no arguments and return that same value as the corresponding environment variable. For example, to get the additional path information passed to your CGI script, you can use the following method:

my $path = $q->path_info;

This is the same information that you could also get this way:

my $path = $ENV{PATH_INFO};

However, a few methods differ or have features worth noting. Let's take a look at these.

5.2.1.1. Accept

As a general rule, if a CGI.pm method has the same name as a built-in Perl function or keyword (e.g., accept or tr), then the CGI.pm method is capitalized. Although there would be no collision if CGI.pm were available only via an object-oriented syntax, the collision creates problem for people who use it via the standard syntax. accept was originally lowercase, but it was renamed to Accept in version 2.44 of CGI.pm, and the new name affects both syntaxes.

Unlike the other methods that take no arguments and simply return a value, Accept can also be given a content type and it will evaluate to true or false depending on whether that content type is acceptable according to the HTTP-Accept header:

if ( $q->Accept( "image/png" ) ) {
    .
    .
    .

Keep in mind that most browsers today send */* in their Accept header. This matches anything, so using the Accept method in this manner is not especially useful. For new file formats like image/png, it is best to get the values for the HTTP header and perform the check yourself, ignoring wildcard matches (this is unfortunate, since it defeats the purpose of wildcards):

my @accept = $q->Accept;
if ( grep $_ eq "image/png", @accept ) {
    .
    .
    .

5.2.1.2. http

If the http method is called without arguments, it returns the name of the environment variables available that contain an HTTP_ prefix. If you call http with an argument, then it will return the value of the corresponding HTTP_ environment variable. When passing an argument to http, the HTTP_ prefix is optional, capitalization does not matter, and hyphens and underscores are interpreted the same. In other words, you can pass the actual HTTP header field name or the environment variable name or even some hybrid of the two, and http will generally figure it out. Here is how you can display all the HTTP_ environment variables your CGI script receives:

#!/usr/bin/perl -wT

use strict;
use CGI;

my $q = new CGI;
print $q->header( "text/plain" );

print "These are the HTTP environment variables I received:\n\n";

foreach ( $q->http ) {
    print "$_:\n";
    print "  ", $q->http( $_ ), "\n";
}

5.2.1.3. https

The https method functions similarly to the http method when it is passed a parameter. It returns the corresponding HTTPS_ environment variable. These variables are set by your web server only if you are receiving a secure request via SSL. When https is called without arguments, it returns the value of the HTTPS environment variable, which indicates whether the connection is secure (its values are server-dependent).

5.2.1.4. query_string

The query_string method does not do what you might think since it does not correspond one-to-one with $ENV{QUERY_STRING}. $ENV{QUERY_STRING} holds the query portion of the URL that called your CGI script. query_string, on the other hand, is dynamic, so if you modify any of the query parameters in your script (see Section 5.2.2.1, "Modifying parameters" later in this chapter), then the value returned by query_string will include these new values. If you want to know what the original query string was, then you should refer to $ENV{QUERY_STRING} instead.

Also, if the request method is POST, then query_string returns the POST parameters that were submitted in the content of the request, and ignores any parameters passed to the CGI script via the query string. This means that if you create a form that submits its values via POST to a URL that also contains a query string, you will not be able to access the parameters on the query string via CGI.pm unless you make a slight modification to CGI.pm to tell it to include parameters from the original query string with POST requests. We'll see how to do this in Section 5.2.2.2, "POST and the query string" later in this chapter.

5.2.1.5. self_url

This method does not correspond to a standard CGI environment variable, although you could manually construct it from other environment variables. It provides you with a URL that can call your CGI with the same parameters. The path information is maintained and the query string is set to the value of the query_string method.

Note that this URL is not necessarily the same URL that was used to call your CGI script. Your CGI script may have been called because of an internal redirection by the web server. Also, because all of the parameters are moved to the query string, this new URL is built to be used with a GET request, even if the current request was a POST request.

5.2.1.6. url

The url method functions similarly to the self_url method, except that it returns a URL to the current CGI script without any parameters, i.e., no path information and an empty query string.

5.2.1.7. virtual_host

The virtual_host method is handy because it returns the value of the HTTP_HOST environment variable, if set, and SERVER_NAME otherwise. Remember that HTTP_HOST is the name of the web server as the browser referred to it, which may differ if multiple domains share the same IP address. HTTP_HOST is available only if the browser supplied the Host HTTP header, added for HTTP 1.1.

5.2.2. Accessing Parameters

param is probably the most useful method CGI.pm provides. It allows you to access the parameters submitted to your CGI script, whether these parameters come to you via a GET request or a POST request. If you call param without arguments, it will return a list of all of the parameter names your script received. If you provide a single argument to it, it will return the value for the parameter with that name. If no parameter with that name was submitted to your script, it returns undef.

It is possible for your CGI script to receive multiple values for a parameter with the same name. This happens when you create two form elements with the same name or you have a select box that allows multiple selections. In this case, param returns a list of all of the values if it is called in a list context and just the first value if it is called in a scalar context. This may sound a little complicated, but in practice it works such that you should end up with what you expect. If you ask param for one value, you will get one value (even if other values were also submitted), and if you ask it for a list, you will always get a list (even if the list contains only one element).

Example 5-1 is a simple example that displays all the parameters your script receives.

Example 5-1. param_list.cgi

#!/usr/bin/perl -wT

use strict;
use CGI;

my $q = new CGI;
print $q->header( "text/plain" );

print "These are the parameters I received:\n\n";

my( $name, $value );

foreach $name ( $q->param ) {
    print "$name:\n";
    foreach $value ( $q->param( $name ) ) {
        print "  $value\n";
    }
}

If you call this CGI script with multiple parameters, like this:

http://localhost/cgi/param_list.cgi?color=red&color=blue&shade=dark

you will get the following output:

These are the parameters I received:

color:
  red
  blue
shade:
  dark

5.2.2.1. Modifying parameters

CGI.pm also lets you add, modify, or delete the value of parameters within your script. To add or modify a parameter, just pass param more than one argument. Using Perl's => operator instead of a comma makes the code easier to read and allows you to omit the quotes around the parameter name, so long as it's a word (i.e., only contains includes letters, numbers, and underscores) that does not conflict with a built-in function or keyword:

$q->param( title => "Web Developer" );

You can create a parameter with multiple values by passing additional arguments:

$q->param( hobbies => "Biking", "Windsurfing", "Music" );

To delete a parameter, use the delete method and provide the name of the parameter:

$q->delete( "age" );

You can clear all of the parameters with delete_all :

$q->delete_all;

It may seem odd that you would ever want to modify parameters yourself, since these will typically be coming from the user. Setting parameters is useful for many reasons, but especially when assigning default values to fields in forms. We will see how to do this later in this chapter.

5.2.2.2. POST and the query string

param automatically determines if the request method is POST or GET. If it is POST, it reads any parameters submitted to it from STDIN. If it is GET, it reads them from the query string. It is possible to POST information to a URL that already has a query string. In this case, you have two souces of input data, and because CGI.pm determines what to do by checking the request method, it will ignore the data in the query string.

You can change this behavior if you are willing to edit CGI.pm. In fact, CGI.pm includes comments to help you do this. You can find this block of code in the init subroutine (the line number will vary depending on the version of CGI.pm you have):

if ($meth eq 'POST') {
    $self->read_from_client(\*STDIN,\$query_string,$content_length,0)
        if $content_length > 0;
    # Some people want to have their cake and eat it too!
    # Uncomment this line to have the contents of the query string
    # APPENDED to the POST data.
    # $query_string .= (length($query_string) ? '&' : '') . $ENV{'QUERY_STRING'}
             if defined $ENV{'QUERY_STRING'};
    last METHOD;
}

By removing the pound sign from the beginning of the line indicated, you will be able to use POST and query string data together. Note that the line you would need to uncomment is too long to display on one line in this text, so it has been wrapped to the next line, but it is just one line in CGI.pm.

5.2.2.3. Index queries

You may receive a query string that contains words that do not comprise name-value pairs. The <ISINDEX> HTML tag, which is not used much anymore, creates a single text field along with a prompt to enter search keywords. When a user enters words into this field and presses Enter, it makes a new request for the same URL, adding the text the user entered as the query string with keywords separated by a plus sign (+), such as this:

http://www.localhost.com/cgi/lookup.cgi?cgi+perl

You can retrieve the list of keywords that the user entered by calling param with "keywords" as the name of the parameter or by calling the separate keywords method:

my @words = $q->keywords;            # these lines do the same thing
my @words = $q->param( "keywords" );

These methods return index keywords only if CGI.pm finds no name-value pair parameters, so you don't have to worry about using "keywords" as the name of an element in your HTML forms; it will work correctly. On the other hand, if you want to POST form data to a URL with a keyword, CGI.pm cannot return that keyword to you. You must use $ENV{QUERY_STRING} to get it.

5.2.2.4. Supporting image buttons as submit buttons

Whether you use <INPUT TYPE="IMAGE" > or <INPUT TYPE="SUBMIT">, the form is still sent to the CGI script. However, with the image button, the name is not transmitted by itself. Instead, the web browser splits an image button name into two separate variables: name.x and name.y.

If you want your program to support image and regular submit buttons interchangeably, it is useful to translate the image button names to normal submit button names. Thus, the main program code can use logic based upon which submit button was clicked even if image buttons later replace them.

To accomplish this, we can use the following code that will set a form variable without the coordinates in the name for each variable that ends in ".x":

foreach ( $q->param ) {
    $q->param( $1, 1 ) if /(.*)\.x/;
}

5.2.3. Exporting Parameters to a Namespace

One of the problems with using a method to retrieve the value of a parameter is that it is more work to embed the value in a string. If you wish to print the value of someone's input, you can use an intermediate variable:

my $name = $q->param( 'user' );
print "Hi, $user!";

Another way to do this is via an odd Perl construct that forces the subroutine to be evaluated as part of an anonymous list:

print "Hi, @{[ $q->param( 'user' ) ]}!";

The first solution is more work and the second can be hard to read. Fortunately, there is a better way. If you know that you are going to need to refer to many output values in a string, you can import all the parameters as variables to a specified namespace:

$q->import_names( "Q" );
print "Hi, $Q::user!";

Parameters with multiple values become arrays in the new namespace, and any characters in a parameter name other than a letter or number become underscores. You must provide a namespace and cannot pass "main", the default namespace, because that might create security risks.

The price you pay for this convenience is increased memory usage because Perl must create an alias for each parameter.

5.2.4. File Uploads with CGI.pm

As we mentioned in the last chapter, it is possible to create a form with a multipart/form-data media type that permits users to upload files via HTTP. We avoided discussing how to handle this type of input then because handling file uploads properly can be quite complex. Fortunately, there's no need for us to do this because, like other form input, CGI.pm provides a very simple interface for handling file uploads.

You can access the name of an uploaded file with the param method, just like the value of any other form element. For example, if your CGI script were receiving input from the following HTML form:

<FORM ACTION="/cgi/upload.cgi" METHOD="POST" ENCTYPE="multipart/form-data">
  <P>Please choose a file to upload:
  <INPUT TYPE="FILE" NAME="file">
  <INPUT TYPE="SUBMIT">
</FORM>

then you could get the name of the uploaded file this way, by referring to the name of the <FILE> input element, in this case "file":

my $file = $q->param( "file" );

The name you receive from this parameter is the name of the file as it appeared on the user's machine when they uploaded it. CGI.pm stores the file as a temporary file on your system, but the name of this temporary file does not correspond to the name you get from this parameter. We will see how to access the temporary file in a moment.

The name supplied by this parameter varies according to platform and browser. Some systems supply just the name of the uploaded file; others supply the entire path of the file on the user's machine. Because path delimiters also vary between systems, it can be a challenge determining the name of the file. The following command appears to work for Windows, Macintosh, and Unix-compatible systems:

my( $file ) = $q->param( "file" ) =~ m|([^/:\\]+)$|;

However, it may strip parts of filenames, since "report 11/3/99" is a valid filename on Macintosh systems and the above command would in this case set $file to "99". Another solution is to replace any characters other than letters, digits, underscores, dashes, and periods with underscores and prevent any files from beginning with periods or dashes:

my $file = $q->param( "file" );
$file =~ s/([^\w.-])/_/g;
$file =~ s/^[-.]+//;

The problem with this is that Netscape's browsers on Windows sends the full path to the file as the filename. Thus, $file may be set to something long and ugly like "C_ _ _Windows_Favorites_report.doc".

You could try to sort out the behaviors of the different operating systems and browsers, check for the user's browser and operating system, and then treat the filename appropriately, but that would be a very poor solution. You are bound to miss some combinations, you would constantly need to update it, and one of the greatest advantages of the Web is that it works across platforms; you should not build any limitations into your solutions.

So the simple, obvious solution is actually nontechnical. If you do need to know the name of the uploaded file, just add another text field to the form allowing the user to enter the name of the file they are uploading. This has the added advantage of allowing a user to provide a different name than the file has, if appropriate. The HTML form looks like this:

<FORM ACTION="/cgi/upload.cgi" METHOD="POST" ENCTYPE="multipart/form-data">
  <P>Please choose a file to upload:
  <INPUT TYPE="FILE" NAME="file">
  <P>Please enter the name of this file:
  <INPUT TYPE="TEXT" NAME="filename">
</FORM>

You can then get the name from the text field, remembering to strip out any odd characters:

my $filename = $q->param( "filename" );
$filename =~ s/([^\w.-])/_/g;
$filename =~ s/^[-.]+//;

So now that we know how to get the name of the file uploaded, let's look at how we get at the content. CGI.pm creates a temporary file to store the contents of the upload; you can get a file handle for this file by passing the name of the file according to the file element to the upload method as follows:

my $file = $q->param( "file" );
my $fh   = $q->upload( $file );

The upload method was added to CGI.pm in Version 2.47. Prior to this you could use the value returned by param (in this case $file) as a file handle in order to read from the file; if you use it as a string it returns the name of the file. This actually still works, but there are conflicts with strict mode and other problems, so upload is the preferred way to get a file handle now. Be sure that you pass upload the name of the file according to param, and not a different name (e.g., the name the user supplied, the name with nonalphanumeric characters replaced with underscores, etc.).

Note that transfer errors are much more common with file uploads than with other forms of input. If the user presses the Stop button in the browser as the file is uploading, for example, CGI.pm will receive only a portion of the uploaded file. Because of the format of multipart/form-data requests, CGI.pm will recognize that the transfer is incomplete. You can check for errors such as this by using the cgi_error method after creating a CGI.pm object. It returns the HTTP status code and message corresponding to the error, if applicable, or an empty string if no error has occurred. For instance, if the Content-length of a POST request exceeds $CGI::POST_MAX, then cgi_error will return "413 Request entity too large". As a general rule, you should always check for an error when you are recording input on the server. This includes file uploads and other POST requests. It doesn't hurt to check for an error with GET requests either.

Example 5-2 provides the complete code, with error checking, to receive a file upload via our previous HTML form.

Example 5-2. upload.cgi

#!/usr/bin/perl -wT

use strict;
use CGI;
use Fcntl qw( :DEFAULT :flock );

use constant UPLOAD_DIR     => "/usr/local/apache/data/uploads";
use constant BUFFER_SIZE    => 16_384;
use constant MAX_FILE_SIZE  => 1_048_576;       # Limit each upload to 1 MB
use constant MAX_DIR_SIZE   => 100 * 1_048_576; # Limit total uploads to 100 MB
use constant MAX_OPEN_TRIES => 100;

$CGI::DISABLE_UPLOADS   = 0;
$CGI::POST_MAX          = MAX_FILE_SIZE;

my $q = new CGI;
$q->cgi_error and error( $q, "Error transferring file: " . $q->cgi_error );

my $file      = $q->param( "file" )     || error( $q, "No file received." );
my $filename  = $q->param( "filename" ) || error( $q, "No filename entered." );
my $fh        = $q->upload( $file );
my $buffer    = "";

if ( dir_size( UPLOAD_DIR ) + $ENV{CONTENT_LENGTH} > MAX_DIR_SIZE ) {
    error( $q, "Upload directory is full." );
}

# Allow letters, digits, periods, underscores, dashes
# Convert anything else to an underscore
$filename =~ s/[^\w.-]/_/g;
if ( $filename =~ /^(\w[\w.-]*)/ ) {
    $filename = $1;
}
else {
    error( $q, "Invalid file name; files must start with a letter or number." );
}

# Open output file, making sure the name is unique
until ( sysopen OUTPUT, UPLOAD_DIR . $filename, O_CREAT | O_EXCL ) {
    $filename =~ s/(\d*)(\.\w+)$/($1||0) + 1 . $2/e;
    $1 >= MAX_OPEN_TRIES and error( $q, "Unable to save your file." );
}

# This is necessary for non-Unix systems; does nothing on Unix
binmode $fh;
binmode OUTPUT;

# Write contents to output file
while ( read( $fh, $buffer, BUFFER_SIZE ) ) {
    print OUTPUT $buffer;
}

close OUTPUT;


sub dir_size {
    my $dir = shift;
    my $dir_size = 0;
    
    # Loop through files and sum the sizes; doesn't descend down subdirs
    opendir DIR, $dir or die "Unable to open $dir: $!";
    while ( readdir DIR ) {
        $dir_size += -s "$dir/$_";
    }
    return $dir_size;
}


sub error {
    my( $q, $reason ) = @_;
    
    print $q->header( "text/html" ),
          $q->start_html( "Error" ),
          $q->h1( "Error" ),
          $q->p( "Your upload was not procesed because the following error ",
                 "occured: " ),
          $q->p( $q->i( $reason ) ),
          $q->end_html;
    exit;
}

We start by creating several constants to configure this script. UPLOAD_DIR is the path to the directory where we will store uploaded files. BUFFER_SIZE is the amount of data to read into memory while transferring from the temporary file to the output file. MAX_FILE_SIZE is the maximum file size we will accept; this is important because we want to limit users from uploading gigabyte-sized files and filling up all of the server's disk space. MAX_DIR_SIZE is the maximum size that we will allow our upload directory to grow to. This restriction is as important as the last because users can fill up our disks by posting lots of small files just as easily as posting large files. Finally, MAX_OPEN_TRIES is the number of times we try to generate a unique filename and open that file before we give up; we'll see why this step is necessary in a moment.

First, we enable file uploads, then we set $CGI::POST_MAX to MAX_FILE_SIZE. Note $CGI::POST_MAX is actually the size of the entire content of the request, which includes the data for other form fields as well as overhead for the multipart/form-data encoding, so this value is actually a little larger than the maximum file size that the script will actually accept. For this form, the difference is minor, but if you add a file upload field to a complex form with multiple text fields, then you should keep this distinction in mind.

We then create a CGI object and check for errors. As we said earlier, errors with file uploads are much more common than with other forms of CGI input. Next we get the file's upload name and the filename the user provided, reporting errors if either of these is missing. Note that a user may be rather upset to get a message saying that the filename is missing after uploading a large file via a modem. There is no way to interrupt that transfer, but in a production application, it might be more user-friendly to save the unnamed file temporarily, prompt the user for a filename, and then rename the file. Of course, you would then need periodically clean up temporary files that were abandoned.

We get a file handle, $fh, to the temporary file where CGI.pm has stored the input. We check whether our upload directory is full and report an error if this is the case. Again, this message is likely to create some unhappy users. In a production application you should add code to notify an administrator who can see why the upload directory is full and resolve the problem. See Chapter 9, "Sending Email".

Next, we replace any characters in the filename the user supplied that may cause problems with an underscore and make sure the name doesn't start with a period or a dash. The odd construct that reassigns the result of the regular expression to $filename untaints that variable. We'll discuss tainting and why this is important in Chapter 8, "Security". We confirm again that $filename is not empty (which would happen if it had consisted of nothing but periods and/or dashes) and generate an error if this is the case.

We try to open a file with this name in our upload directory. If we fail, then we add a digit to $filename and try again. The regular expression allows us to keep the file extension the same: if there is already a report.txt file, then the next upload with that name will be named report1.txt, the next one report2.txt, etc. This continues until we exceed MAX_OPEN_TRIES . It is important that we create a limit to this loop because there may be a reason other than a non-unique name that prevents us from saving the file. If the disk is full or the system has too many open files, for example, we do not want to start looping endlessly. This error should also notify an administrator that something is wrong.

This script is written to handle any type of file upload, including binary files such as images or audio. By default, whenever Perl accesses a file handle on non-Unix systems (more specifically, systems that do not use \n as their end of line character), Perl translates the native operating system's end of line characters, such as \r\n for Windows or \r for MacOS, to \n on input and back to the native characters on output. This works great for text files, but it can corrupt binary files. Thus, we enable binary mode with the binmode function in order to disable this translation. On systems, like Unix, where no end of line translation occurs, binmode has no effect.

Finally, we read from our temporary file handle and write to our output file and exit. We use the read function to read and write a chunk a data at a time. The size of this chunk is defined by our BUFFER_SIZE constant. In case you are wondering, CGI.pm will remove its temporary file automatically when our script exits (technically, when $q goes out of scope).

There is another way we could have moved the file to our uploads directory. We could use CGI.pm's undocumented tmpFileName method to get the name of the temporary file containing the upload and then used Perl's rename function to move the file. However, relying on undocumented code is dangerous, because it may not be compatible with future versions of CGI.pm. Thus, in our example we stick to the public API instead.

The dir_size subroutine calculates the size of a directory by summing the size of each of its files. The error subroutine prints a message telling the user why the transfer failed. In a production application, you probably want to provide links for the user to get help or to notify someone about problems.



Library Navigation Links

Copyright © 2001 O'Reilly & Associates. All rights reserved.