Book HomeCGI Programming with PerlSearch this book

Chapter 10. Data Persistence

Contents:

Text Files
DBM Files
Introduction to SQL
DBI

Many basic web applications can be created that output only email and web documents. However, if you begin building larger web applications, you will eventually need to store data and retrieve it later. This chapter will discuss various ways to do this with different levels of complexity. Text files are the simplest way to maintain data, but they quickly become inefficient when the data becomes complex or grows too large. A DBM file provides much faster access, even for large amounts of data, and DBM files are very easy to use with Perl. However, this solution is also limited when the data grows too complex. Finally, we will investigate relational databases. A relational database management system (RDBMS) provides high performance even with complex queries. However, an RDBMS is more complicated to set up and use than the other solutions.

Applications evolve and grow larger. What may start out as a short, simple CGI script may gain feature upon feature until it has grown to a large, complex application. Thus, when you design web applications, it is a good idea to develop them so that they are easily expandable.

One solution is to make your solutions modular. You should try to abstract the code that reads and writes data so the rest of the code does not know how the data is stored. By reducing the dependency on the data format to a small chunk of code, it becomes easier to change your data format as you need to grow.

10.1. Text Files

One of Perl's greatest strengths is its ability to parse text, and this makes it especially easy to get a web application online quickly using text files as the means of storing data. Although it does not scale to complex queries, this works well for small amounts of data and is very common for Perl CGI applications. We're not going to discuss how to use text files with Perl, since most Perl programmers are already proficient at that task. We're also not going to look at strategies like creating random access files to improve performance, since that warrants a lengthy discussion, and a DBM file is generally a better substitute. We'll simply look at the issues that are particular to using text files with CGI scripts.

10.1.1. Locking

If you write to any files from a CGI script, then you must use some form of file locking. Web servers support numerous concurrent connections, and if two users try to write to the same file at the same time, the result is generally corrupted or truncated data.

10.1.1.1. flock

If your system supports it, using the flock command is the easiest way to do this. How do you know if your system supports flock? Try it: flock will die with a fatal error if your system does not support it. However, flock works reliably only on local files; flock does not work across most NFS systems, even if your system otherwise supports it.[19] flock offers two different modes of locking: exclusive and shared. Many processes can read from a file simultaneously without problems, but only one process should write to the file at a time (and no other process should read from the file while it is being written). Thus, you should obtain an exclusive lock on a file when writing to it and a shared lock when reading from it. The shared lock verifies that no one else has an exclusive lock on the file and delays any exclusive locks until the shared locks have been released.

[19]If you need to lock a file across NFS, refer to the File::LockDir module in Perl Cookbook (O'Reilly & Associates, Inc.).

To use flock, call it with a filehandle to an open file and a number indicating the type of lock you want. These numbers are system-dependent, so the easiest way to get them is to use the Fcntl module. If you supply the :flock argument to Fcntl, it will export LOCK_EX, LOCK_SH, LOCK_UN, and LOCK_NB for you. You can use them as follows:

use Fcntl ":flock";

open FILE, "some_file.txt" or die $!;
flock FILE, LOCK_EX;    # Exclusive lock
flock FILE, LOCK_SH;    # Shared lock
flock FILE, LOCK_UN;    # Unlock

Closing a filehandle releases any locks, so there is generally no need to specifically unlock a file. In fact, it can be dangerous to do so if you are locking a filehandle that uses Perl's tie mechanism. See file locking in the DBM section of this chapter for more information.

Some systems do not support shared file locks and use exclusive locks for them instead. You can use the script in Example 10-1 to test what flock supports on your system.

Example 10-1. flock_test.pl

#!/usr/bin/perl -wT

use IO::File;
use Fcntl ":flock";

*FH1 = new_tmpfile IO::File or die "Cannot open temporary file: $!\n";

eval { flock FH1, LOCK_SH };
$@ and die "It does not look like your system supports flock: $@\n";

open FH2, ">> &FH1" or die "Cannot dup filehandle: $!\n";

if ( flock FH2, LOCK_SH | LOCK_NB ) {
    print "Your system supports shared file locks\n";
}
else {
    print "Your system only supports exclusive file locks\n";
}

If you need to both read and write to a file, then you have two options: you can open the file exclusively for read/write access, or if you only have to do limited writing and what you're writing does not depend on the contents of the file, you can open and close the file twice: once shared for reading and once exclusive for writing. This is generally less efficient than opening the file once, but if you have lots of processes needing to access the file that are doing lots of reading and little writing, it may be more efficient to reduce the time that one process is tying up the file while holding an exclusive lock on it.

Typically when you use flock to lock a file, it halts the execution of your script until it can obtain a lock on your file. The LOCK_NB option tells flock that you do not want it to block execution, but allow your script to continue if it cannot obtain a lock. Here is one way to time out if you cannot obtain a lock on a file:

my $count = 0;
my $delay = 1;
my $max   = 15;

open FILE, ">> $filename" or
    error( $q, "Cannot open file: your data was not saved" );

until ( flock FILE, LOCK_SH | LOCK_NB ) {
    error( $q, "Timed out waiting to write to file: " .
                     "your data was not saved" ) if $count >= $max;
    sleep $delay;
    $count += $delay;
}

In this example, the code tries to get a lock. If it fails, it waits a second and tries again. After fifteen seconds, it gives up and reports an error.

10.1.2. Write Permissions

In order to create or update a text file, you must have the appropriate permissions. This may sound basic, but it is a common source of errors in CGI scripts, especially on Unix filesystems. Let's review how Unix file permissions work.

Files have both an owner and a group. By default, these match the user and group of the user or process who creates the file. There are three different levels of permissions for a file: the owner's permissions, the group's permissions, and everyone else's permissions. Each of these may have read access, write access, and/or execute access for a file.

Your CGI scripts can only modify a file if nobody (or the user your web server runs as) has write access to the file. This occurs if the file is writable by everyone, if it is writable by members of the file's group and nobody is a member of that group, or if nobody owns the file and the file is writable by its owner.

In order to create or remove a file, nobody must have write permission to the directory containing the file. The same rules about owner, group, and other users apply to directories as they do for files. In addition, the execute bit must be set for the directory. For directories, the execute bit determines scan access, which is the ability to change to the directory.

Even though your CGI script may not modify a file, it may be able to replace it. If nobody has permission to write to a directory, then it can remove files in the directory in addition to creating new files, even with the same name. Write permissions on the file do not typically affect the ability to remove or replace the file as a whole.

10.1.3. Temporary Files

Your CGI scripts may need to create temporary files for a number of reasons. You can reduce memory consumption by creating files to hold data as you process it; you gain efficiency by sacrificing performance. You may also use external commands that perform their actions on text files.

10.1.4. Delimiters

If you need to include multiple fields of data in each line of your text file, you will likely use delimiters to separate them. Another option is to create fixed-length records, but we won't get into these files here. Common characters to use for delimiting files are commas, tabs, and pipes (|).

Commas are primarily used in CSV files, which we will discuss presently. CSV files can be difficult to parse accurately because they can include non-delimiting commas as part of a value. When working with CSV files, you may want to consider the DBD::CSV module; this gives you a number of additional benefits, which we will discuss shortly.

Tabs are not generally included within data, so they make convenient delimiters. Even so, you should always check your data and encode or remove any tabs or end-of-line characters before writing to your file. This ensures that your data does not become corrupted if someone happens to pass a newline character in the middle of a field. Remember, even if you are reading data from an HTML form element that would not normally accept a newline character as part of it, you should never trust the user or that user's browser.

Here is an example of functions you can use to encode and decode data:

sub encode_data {
    my @fields = map {
        s/\\/\\\\/g;
        s/\t/\\t/g;
        s/\n/\\n/g;
        s/\r/\\r/g;
        $_;
    } @_;
    
    my $line = join "\t", @fields;
    return "$line\n";
}

sub decode_data {
    my $line = shift;
    
    chomp $line;
    my @fields = split /\t/, $line;
    
    return map {
        s/\\(.)/$1 eq 't' and "\t" or
                $1 eq 'n' and "\n" or
                $1 eq 'r' and "\r" or
                "$1"/eg;
        $_;
    } @fields;
}

These functions encode tabs and end-of-line characters with the common escape characters that Perl and other languages use (\t, \r, and \n). Because it is introducing additional backslashes as an escape character, it must also escape the backslash character.

The encode_data sub takes a list of fields and returns a single encoded scalar that can be written to the file; decode_data takes a line read from the file and returns a list of decoded fields. You can use them as shown in Example 10-2.

Note that organizing your code this way gives you another benefit. If you later decide you want to change the format of your data, you do not need to change your entire CGI script, just the encode_data and decode_data functions.

10.1.5. DBD::CSV

As we mentioned at the beginning of this chapter, it's great to modularize your code so that changing the data format affects only a small chunk of your application. However, it's even better if you don't have to change that chunk either. If you are creating a simple application that you expect to grow, you may want to consider developing your application using CSV files. CSV (comma separated values) files are text files formatted such that each line is a record, and fields are delimited by commas. The advantage to using CSV files is that you can use Perl's DBI and DBD::CSV modules, which allow you to access the data via basic SQL queries just as you would for an RDBMS. Another benefit of CSV format is that it is quite common, so you can easily import and export it from other applications, including spreadsheets like Microsoft Excel.

There are drawbacks to developing with CSV files. DBI adds a layer of complexity to your application that you would not otherwise need if you accessed the data directly. DBI and DBD::CSV also allow you to create only simple SQL queries, and it is certainly not as fast as a true relational database system, especially for large amounts of data.

However, if you need to get a project going, knowing that you will move to an RDBMS, and if DBD::CSV meets your immediate requirements, then this strategy is certainly a good choice. We will look at an example that uses DBD::CSV later in this chapter.



Library Navigation Links

Copyright © 2001 O'Reilly & Associates. All rights reserved.