Data Persistence (CGI Programming with Perl)

10.1. Text Files

One of Perl's greatest strengths is its ability to parse text, and this makes it especially easy to get a web application online quickly using text files as the means of storing data. Although it does not scale to complex queries, this works well for small amounts of data and is very common for Perl CGI applications. We're not going to discuss how to use text files with Perl, since most Perl programmers are already proficient at that task. We're also not going to look at strategies like creating random access files to improve performance, since that warrants a lengthy discussion, and a DBM file is generally a better substitute. We'll simply look at the issues that are particular to using text files with CGI scripts.

10.1.1. Locking

If you write to any files from a CGI script, then you must use some form of file locking. Web servers support numerous concurrent connections, and if two users try to write to the same file at the same time, the result is generally corrupted or truncated data.

10.1.1.1. flock

If your system supports it, using the flock command is the easiest way to do this. How do you know if your system supports flock? Try it: flock will die with a fatal error if your system does not support it. However, flock works reliably only on local files; flock does not work across most NFS systems, even if your system otherwise supports it.[19] flock offers two different modes of locking: exclusive and shared. Many processes can read from a file simultaneously without problems, but only one process should write to the file at a time (and no other process should read from the file while it is being written). Thus, you should obtain an exclusive lock on a file when writing to it and a shared lock when reading from it. The shared lock verifies that no one else has an exclusive lock on the file and delays any exclusive locks until the shared locks have been released.

[19]If you need to lock a file across NFS, refer to the File::LockDir module in Perl Cookbook (O'Reilly & Associates, Inc.).

To use flock, call it with a filehandle to an open file and a number indicating the type of lock you want. These numbers are system-dependent, so the easiest way to get them is to use the Fcntl module. If you supply the :flock argument to Fcntl, it will export LOCK_EX, LOCK_SH, LOCK_UN, and LOCK_NB for you. You can use them as follows:

use Fcntl ":flock";

open FILE, "some_file.txt" or die $!;
flock FILE, LOCK_EX;    # Exclusive lock
flock FILE, LOCK_SH;    # Shared lock
flock FILE, LOCK_UN;    # Unlock

Closing a filehandle releases any locks, so there is generally no need to specifically unlock a file. In fact, it can be dangerous to do so if you are locking a filehandle that uses Perl's tie mechanism. See file locking in the DBM section of this chapter for more information.

Some systems do not support shared file locks and use exclusive locks for them instead. You can use the script in Example 10-1 to test what flock supports on your system.

Example 10-1. flock_test.pl

#!/usr/bin/perl -wT

use IO::File;
use Fcntl ":flock";

*FH1 = new_tmpfile IO::File or die "Cannot open temporary file: $!\n";

eval { flock FH1, LOCK_SH };
$@ and die "It does not look like your system supports flock: $@\n";

open FH2, ">> &FH1" or die "Cannot dup filehandle: $!\n";

if ( flock FH2, LOCK_SH | LOCK_NB ) {
    print "Your system supports shared file locks\n";
}
else {
    print "Your system only supports exclusive file locks\n";
}

If you need to both read and write to a file, then you have two options: you can open the file exclusively for read/write access, or if you only have to do limited writing and what you're writing does not depend on the contents of the file, you can open and close the file twice: once shared for reading and once exclusive for writing. This is generally less efficient than opening the file once, but if you have lots of processes needing to access the file that are doing lots of reading and little writing, it may be more efficient to reduce the time that one process is tying up the file while holding an exclusive lock on it.

Typically when you use flock to lock a file, it halts the execution of your script until it can obtain a lock on your file. The LOCK_NB option tells flock that you do not want it to block execution, but allow your script to continue if it cannot obtain a lock. Here is one way to time out if you cannot obtain a lock on a file:

my $count = 0;
my $delay = 1;
my $max   = 15;

open FILE, ">> $filename" or
    error( $q, "Cannot open file: your data was not saved" );

until ( flock FILE, LOCK_SH | LOCK_NB ) {
    error( $q, "Timed out waiting to write to file: " .
                     "your data was not saved" ) if $count >= $max;
    sleep $delay;
    $count += $delay;
}

In this example, the code tries to get a lock. If it fails, it waits a second and tries again. After fifteen seconds, it gives up and reports an error.

10.1.1.2. Manual lock files

If your system does not support flock, you will need to manually create your own lock files. As the Perl FAQ points out (see perlfaq5 ), this is not as simple as you might think. The problem is that you must check for the existence of a file and create the file as one operation. If you first check whether a lock file exists, and then try to create one if it does not, another process may have created its own lock file after you checked, and you just overwrote it.

To create your own lock file, use the following command:

use Fcntl;
.
.
.
sysopen LOCK_FILE, "$filename.lock", O_WRONLY | O_EXCL | O_CREAT, 0644
    or error( $q, "Unable to lock file: your data was not saved" ):

The O_EXCL function provided by Fcntl tells the system to open the file only if it does not already exist. Note that this will not reliably work on an NFS filesystem.

10.1.2. Write Permissions

In order to create or update a text file, you must have the appropriate permissions. This may sound basic, but it is a common source of errors in CGI scripts, especially on Unix filesystems. Let's review how Unix file permissions work.

Files have both an owner and a group. By default, these match the user and group of the user or process who creates the file. There are three different levels of permissions for a file: the owner's permissions, the group's permissions, and everyone else's permissions. Each of these may have read access, write access, and/or execute access for a file.

Your CGI scripts can only modify a file if nobody (or the user your web server runs as) has write access to the file. This occurs if the file is writable by everyone, if it is writable by members of the file's group and nobody is a member of that group, or if nobody owns the file and the file is writable by its owner.

In order to create or remove a file, nobody must have write permission to the directory containing the file. The same rules about owner, group, and other users apply to directories as they do for files. In addition, the execute bit must be set for the directory. For directories, the execute bit determines scan access, which is the ability to change to the directory.

Even though your CGI script may not modify a file, it may be able to replace it. If nobody has permission to write to a directory, then it can remove files in the directory in addition to creating new files, even with the same name. Write permissions on the file do not typically affect the ability to remove or replace the file as a whole.

10.1.3. Temporary Files

Your CGI scripts may need to create temporary files for a number of reasons. You can reduce memory consumption by creating files to hold data as you process it; you gain efficiency by sacrificing performance. You may also use external commands that perform their actions on text files.

10.1.3.1. Anonymous temporary files

Typically, temporary files are anonymous; they are created by opening a handle to a new file and then immediately deleting the file. Your CGI script will continue to have a filehandle to access the file, but the data cannot be accessed by other processes, and the data will be reclaimed by the filesystem once your CGI script closes the filehandle. (Not all systems support this feature.)

As for most common tasks, there is a Perl module that makes managing temporary files much simpler. IO::File will create anonymous temporary files for you with the new_tmpfile class method; it takes no arguments. You can use it like this:[20]

[20]Actually, if the filesystem does not support anonymous temporary files, then IO::File will not create it anonymously, but it's still anonymous to you since you cannot get at the name of the file. IO::File will take care of managing and deleting the file for you when its filehandle goes out of scope or your script completes.

use IO::File;
.
.
.
my $tmp_fh = new_tmpfile IO::File;

You can then read and write to $tmp_fh just as you would any other filehandle:

print $tmp_fh "</html>\n";

seek $tmp_fh, 0, 0;
while (<$tmp_fh>) {
    print;
}

10.1.3.2. Named temporary files

Another option is to create a file and delete it when you are finished with it. One advantage is that you have a filename that can be passed to other processes and functions. Also, using the IO::File module is considerably slower than managing the file yourself. However, using named temporary files has two drawbacks. First, greater care must be taken choosing a unique filename so that two scripts will not attempt to use the same temporary file at the same time. Second, the CGI script must delete the file when it is finished, even if it encounters an error and exits prematurely.

The Perl FAQ suggests using the POSIX module to generate a temporary filename and an END block to ensure it will be cleaned up:

use Fcntl;
use POSIX qw(tmpnam);
.
.
.
my $tmp_filename;

# try new temporary filenames until we get one that doesn't already
# exist; the check should be unnecessary, but you can't be too careful
do { $tmp_filename = tmpnam(  ) }
    until sysopen( FH, $name, O_RDWR|O_CREAT|O_EXCL );

# install atexit-style handler so that when we exit or die,
# we automatically delete this temporary file
END { unlink( $tmp_filename ) or die "Couldn't unlink $name: $!" }

If your system doesn't support POSIX, then you will have to create the file in a system-dependent fashion instead.

10.1.4. Delimiters

If you need to include multiple fields of data in each line of your text file, you will likely use delimiters to separate them. Another option is to create fixed-length records, but we won't get into these files here. Common characters to use for delimiting files are commas, tabs, and pipes (|).

Commas are primarily used in CSV files, which we will discuss presently. CSV files can be difficult to parse accurately because they can include non-delimiting commas as part of a value. When working with CSV files, you may want to consider the DBD::CSV module; this gives you a number of additional benefits, which we will discuss shortly.

Tabs are not generally included within data, so they make convenient delimiters. Even so, you should always check your data and encode or remove any tabs or end-of-line characters before writing to your file. This ensures that your data does not become corrupted if someone happens to pass a newline character in the middle of a field. Remember, even if you are reading data from an HTML form element that would not normally accept a newline character as part of it, you should never trust the user or that user's browser.

Here is an example of functions you can use to encode and decode data:

sub encode_data {
    my @fields = map {
        s/\\/\\\\/g;
        s/\t/\\t/g;
        s/\n/\\n/g;
        s/\r/\\r/g;
        $_;
    } @_;
    
    my $line = join "\t", @fields;
    return "$line\n";
}

sub decode_data {
    my $line = shift;
    
    chomp $line;
    my @fields = split /\t/, $line;
    
    return map {
        s/\\(.)/$1 eq 't' and "\t" or
                $1 eq 'n' and "\n" or
                $1 eq 'r' and "\r" or
                "$1"/eg;
        $_;
    } @fields;
}

These functions encode tabs and end-of-line characters with the common escape characters that Perl and other languages use (\t, \r, and \n). Because it is introducing additional backslashes as an escape character, it must also escape the backslash character.

The encode_data sub takes a list of fields and returns a single encoded scalar that can be written to the file; decode_data takes a line read from the file and returns a list of decoded fields. You can use them as shown in Example 10-2.

Example 10-2. sign_petition.cgi

#!/usr/bin/perl -wT

use strict;
use Fcntl ":flock";
use CGI;
use CGIBook::Error;

my $DATA_FILE = "/usr/local/apache/data/tab_delimited_records.txt";

my $q       = new CGI;
my $name    = $q->param( "name" );
my $comment = substr( $q->param( "comment" ), 0, 80 );

unless ( $name ) {
    error( $q, "Please enter your name." );
}

open DATA_FILE, ">> $DATA_FILE" or die "Cannot append to $DATA_FILE: $!";
flock DATA_FILE, LOCK_EX;
seek DATA_FILE, 0, 2;

print DATA_FILE encode_data( $name, $comment );
close DATA_FILE;

print $q->header( "text/html" ),
      $q->start_html( "Our Petition" ),
      $q->h2( "Thank You!" ),
      $q->p( "Thank you for signing our petition. ",
             "Your name has been been added below:" ),
      $q->hr,
      $q->start_table,
      $q->tr( $q->th( "Name", "Comment" ) );
      
open DATA_FILE, $DATA_FILE or die "Cannot read $DATA_FILE: $!";
flock DATA_FILE, LOCK_SH;

while (<DATA_FILE>) {
    my @data = decode_data( $_ );
    print $q->tr( $q->td( @data ) );
}
close DATA_FILE;

print $q->end_table,
      $q->end_html;


sub encode_data {
    my @fields = map {
        s/\\/\\\\/g;
        s/\t/\\t/g;
        s/\n/\\n/g;
        s/\r/\\r/g;
        $_;
    } @_;
    
    my $line = join "\t", @fields;
    return $line . "\n";
}

sub decode_data {
    my $line = shift;
    
    chomp $line;
    my @fields = split /\t/, $line;
    
    return map {
        s/\\(.)/$1 eq 't' and "\t" or
                $1 eq 'n' and "\n" or
                $1 eq 'r' and "\r" or
                "$1"/eg;
        $_;
    } @fields;
}

Note that organizing your code this way gives you another benefit. If you later decide you want to change the format of your data, you do not need to change your entire CGI script, just the encode_data and decode_data functions.

10.1.5. DBD::CSV

As we mentioned at the beginning of this chapter, it's great to modularize your code so that changing the data format affects only a small chunk of your application. However, it's even better if you don't have to change that chunk either. If you are creating a simple application that you expect to grow, you may want to consider developing your application using CSV files. CSV (comma separated values) files are text files formatted such that each line is a record, and fields are delimited by commas. The advantage to using CSV files is that you can use Perl's DBI and DBD::CSV modules, which allow you to access the data via basic SQL queries just as you would for an RDBMS. Another benefit of CSV format is that it is quite common, so you can easily import and export it from other applications, including spreadsheets like Microsoft Excel.

There are drawbacks to developing with CSV files. DBI adds a layer of complexity to your application that you would not otherwise need if you accessed the data directly. DBI and DBD::CSV also allow you to create only simple SQL queries, and it is certainly not as fast as a true relational database system, especially for large amounts of data.

However, if you need to get a project going, knowing that you will move to an RDBMS, and if DBD::CSV meets your immediate requirements, then this strategy is certainly a good choice. We will look at an example that uses DBD::CSV later in this chapter.