Attaching Files to Forms

Reuven M. Lerner

Issue #46, February 1998

Mr. Lerner shows us a way to use file elements to allow web site visitors to upload information or program files to the site.

Here's a relatively easy question for anyone who has been working with the Web for a while: How would you allow visitors to your site to send you their name and address? The easiest solution would be to use an HTML form containing one or more text fields. The contents of the form would be sent to a CGI program on the site's server, which would retrieve the fields' contents.

What if we were interested in sending more than just a few words or phrases? What if we were interested in allowing visitors to our site to enter large volumes of text? We could use a <textarea> tag, which gives the user much more room in which to write. But <textarea> elements, like all HTML form elements, are still a bit clunky. Wouldn't it be nice if there were a way to simply attach a file to an HTML form, much as we can attach files to e-mail messages?

Using the relatively unknown file element, we can do just that. File elements are in many ways similar to text and hidden elements, in that they contain strings of text, rather than simple on-off indicators (as with check boxes) or one of a number of possible strings (as with radio buttons and selection lists). In addition to sending the name of the file, file elements send the contents of the file along with the HTML form.

Before we get to some practical uses for the file element, let's look at a simple example of what is possible. Our initial form shown in Listing 1 contains a single file element and a submit button. It is similar to forms that we have seen in previous installments of At the Forge and is probably quite similar to forms that you have seen on other web sites. The form begins with a <form> tag indicating the method that should be used when sending the data (POST) and the CGI program to which the data is to be sent (/cgi-bin/upload-file.pl). We then have a single form element, as well as a “submit” button for sending the data.

There are two differences between this form and the forms that we have seen before. For starters, the form element has a third attribute (ENCTYPE) that we can generally ignore, because the default value (application/x-www-form-urlencoded) is sufficient for most purposes. However, URL-encoding (in which characters are replaced by a percent sign followed by their ASCII code in hexadecimal, e.g., the space character becomes %20) is inefficient when used on large files, particularly when those files contain a large number of characters that require the coding. In addition, we want to separate the form elements from the file (or files) being uploaded, and we want to tag the uploaded file with a MIME-style content-type header indicating the type of data that is being sent.

For all of these reasons, using the form element requires the use of a new ENCTYPE setting (one defined in RFC 1867 available on the Web at http://www.internic.net/): multipart/form-data. With this encoding type, the contents of each uploaded file will be sent separately, without URL-encoding and with a “Content-type” header describing the type of data contained within. Aside from having to remember to set the encoding type explicitly at the top of any forms containing file elements, we do not have to worry too much about the way in which files are submitted to our CGI programs.

The other new element in the above form is the file element itself. When presented in HTML source code, a file element appears to be quite similar to a “text” element. We assign it a name, and the value will presumably come from the user.

The file element is different from other elements in two ways. First of all, it tells the user's browser to send not just the file name specified in the file element, but also the contents of the file associated with that name.

More obvious to the user, however, is the fact that a file element appears in the user's browser as the combination of a text field and a button. The user can enter a filename by typing into the text field, or—and this is the unusual part—she can browse through the file system using a dialog box brought up by the “browse” button. When the user selects a file to upload by using the “browse” button, the name of the file is entered into the text field, as if the user had typed it.

Receiving the File

Now that we know how to set up the form for uploading files, let's look at a small CGI program that will accept the file that was uploaded. For starters in Listing 2 we will simply have our program print the uploaded file on the screen.

Let's run through this program, in case you aren't completely familiar with CGI programs written in Perl. First of all, we start up Perl with the -w flag to warn us if we are doing something particularly stupid. We also turn on diagnostics so that Perl will give us a verbose error message if and when it detects an error.

Normally, any program I write includes the line use strict to catch potentially dangerous or foolish constructs that I might have built. However, as you will see, we will be playing some games with references later on, and we must turn off the strict package when dealing with references so that our program does not crash. Immediately after importing the “strict” module, we thus turn off strict checking on references by using the “no” pragma (a construct for telling Perl how to handle your program).

Then, we load CGI.pm, the package that takes care of most of the dirty work for CGI programs. We create an instance of CGI and use the “header” method to send an initial MIME-style header to the user's browser, indicating that we will be sending HTML-encoded text in our response.

Next, we retrieve the value of the file name entered by the user from the form element named userfile and put it into a variable named $userfile. Until now, $userfile could have come from a text or hidden element as easily as from a form element.

Now comes the wild part. We use $username as a file handle, and iterate over it using the <> operator to retrieve the contents of the file. I must admit that when I first wrote programs that took advantage of uploading files, I was floored—could I really use the variable that I had assigned as a file handle? The answer is that it does indeed work rather well.

Checking the Uploaded File's Type

There is at least one problem with our program. What happens if the user uploads a GIF or JPEG image? We will end up displaying a good deal of garbage on the user's screen, since the image will be sent and then displayed as if it were HTML.

One solution is to use the accept attribute that can be used with the file element. In theory, accept should be set to one or more file types that the user should be allowed to send via the form. Thus, if we were only interested in receiving HTML files, we could say:

<P>File to upload: <INPUT NAME="userfile"
        TYPE="file" value=""
        accept="*.html"></P>

This statement would restrict the user to uploading files with the .html extension, which would presumably be HTML files. In practice, I have found that while the accept setting changes the filter in the window brought up by the browse button, the accept setting is not enforced, and users can enter whatever they might like in the text field.

If we are truly interested in ensuring that only HTML files are uploaded to our program, we need to modify our CGI program such that it checks the Content-type header of a file before displaying it. If the file has a Content-type of text/html, it is considered acceptable and printed; if not, a short error message is displayed.

We can check the headers associated with a file by calling uploadInfo on the file name ($userfile), which returns a reference to a hash, which in turn contains all of the headers associated with a particular file. Listing 3 is a slightly modified version of our previous program which prints the headers before the file.

Once we have retrieved the headers as shown in Listing 3, it becomes relatively easy to receive only certain kinds of files. We could, for instance, add the following lines to our program to ensure that uploaded files have a Content-type of text/html:

if ($headers{"Content-Type"} ne "text/html")
        {
        print "<P>Sorry, only HTML files.</P>\n";
        exit;
        }

Doing Something with the File

Uploading files is fine and dandy, but the point of uploading files is to use them, not simply display them on the screen for users to see. Now that we have seen how to upload a program from the user's browser to a CGI program running on our Web server, let's try to use this program for something practical.

Let's take a simple example, one which comes from a program that I wrote for a site. Our site sat on a server rented from a web space provider, meaning that while we had control over our individual HTML files and CGI programs, system administration (including user names) was controlled by the company from which we rented the space.

The problem began when those of us working on the site decided that we wished to allow members of various affiliated organizations and subgroups to add HTML files within particular directories. That is, we created a directory for each group affiliated with our site, and expected that each group would be able to add and modify HTML files as necessary.

However, we only had a single login for our site, and we certainly didn't wish to jeopardize the site's security by giving out that user name and password to each of the 40 or 50 affiliated organizations. At the same time, given that each organization would change no more than ten HTML files in a given month, it seemed an extreme and costly measure to order user names for each group.

We finally decided to allow users to upload files into their organization's directory—and only into that directory—using an HTML form similar to the one we saw above. Whereas the above form only requested that the user enter a file name (either by typing it in directly or by clicking on the browse button), we now asked for three additional pieces of information: the directory into which the file should be deposited, the name that the file should be given once it reaches that directory and the password for that directory.

We ask for a directory name and password to ensure that users only deposit files into the directory for which they have been given permission by the system administrators. Passwords are not a perfect security system, but they work relatively well, are portable and are easy to understand. In this way, members of group A can upload into the A directory and members of group B can upload into the B directory, and both groups can be sure that no one else is modifying the files in their directory.

One possible version of the HTML form that could be used to upload files to a CGI program is in Listing 4. Notice that in that file, I have used HTML tables to separate elements. That choice was made for purely aesthetic reasons, so that each of the form elements would line up with one another.

When the user enters a file name, directory name, password and destination name in the form and clicks on the “Upload file” button, the four form elements are sent to the CGI program (/cgi-bin/upload-file.pl) along with the contents of the file named in the file element, named userfile. We want to write a program that takes the contents of userfile and saves it in the appropriate directory (the section element) with the appropriate name (the file name element). Of course, all of this happens only if the user enters the correct password for that section.

Writing a program that does this sounds pretty straightforward, right? Well, it is; take a look at Listing 5 to see what a basic version might look like.

The password system in this program is a simple hash whose keys are the different sections and whose values are the passwords for those sections. If you have a small number of sections on your site, you can set the passwords within the program , as shown in Listing 5.

Even though it is easy to add new sections and passwords in this way, it is not a good idea for you to modify the source code for something this simple. It is a good idea to put password information in a text file, a DBM-style file (which is basically a hash saved to disk), or even a small SQL database (as we saw in a series of At the Forge installments earlier this year). Then again, if the number of sections is small and doesn't change very often, you might simply stick with the example system displayed here.

In order to ensure that users do not abuse the uploading system, we remove everything in the uploaded file name up to and including the first slash. This makes it rather difficult for someone to try to deposit one of their files in someone else's directory by taking advantage of the “..” directory name, an option that means “use this directory's parent”.

After checking to make sure that we have received all of the required information, we compare the correct password with what the user entered:

&log_and_die("Incorrect password")
        unless ($PASSWORD{$section} eq $password);

Now that we have established that we have all of the needed information and that the user is authorized, we save the file to disk using a simple “while” loop that reads from $userfile (which can be treated as a file handle) and printing to FILE, the handle that we created for saving information to disk in Listing 5:

open (FILE, ">$saved_filename") || &log_and_die(
           "Cannot write to $saved_filename: $! ");
        while (<$userfile>)
        {
        print FILE;
        }
close (FILE);

Finally, let's look at log_and_die shown at the end of Listing 5, a subroutine I include in many CGI programs, which allows us to die relatively gracefully with a reasonable message sent to the user and the error log. This is a far better way for the program to crash than producing the unfriendly “500 Server error” message that is all too common on web sites these days.

When we execute statements such as:

open (FILE, ">$saved_filename") || &log_and_die(
        "Cannot write to $saved_filename: $! ");

we are saying, “Open the file $saved_filename for writing, and allow me to write to $saved_filename using the FILE file handle. But if you cannot open the file, send a message to the user indicating that you cannot open it, along with $!, Perl's special variable containing the most recent error message.” Not only is this message better for visitors to your site, it can be more useful to you when debugging your program.

There are a few drawbacks to this program that deserve mention. The lack of an intelligent backup system (which would be relatively easy to add), the inability to deal with subdirectories (useful when you want to separate images from text) and a fairly primitive user interface are all strikes against this system. But over time, this has proved to be a good compromise between insecurity (i.e., giving everyone affiliated with the site access via the same user name and password) and expense (i.e., buying a new account for each affiliated group, even though the accounts will rarely be used).

The file tag is not something that you will use every day, but when you need to allow users to upload information to your server, it can be quite handy.