This month we learn additional ways to retrieve data from a relational database and ways to divide our data into multiple tables for maximum efficiency.
In the last installment, we looked at ways to use a relational database server from within CGI programs. Relational databases are a great help, since they provide us with a relatively simple means of storing and retrieving information.
Everything in a relational database is stored in two-dimensional tables, with each row representing a record and each column a field within that record. When we use ASCII text files to store our information, we generally have to write our own routines for saving and retrieving that information; by structuring our information in tables, we can save ourselves quite a bit of time and energy, as well as ensure that our data will not be corrupted when multiple instances of our program attempt to modify the information.
Communication with a relational database is carried out using the Structured Query Language (SQL). Because SQL is not a programming language, we have to embed our SQL queries inside of a C or Perl program if we want to execute them.
Last month, we looked at a set of CGI programs which allowed visitors to our web site to send and receive web-based postcards. Users entered information about themselves and their postcards' recipients, along with the name of a graphic and a short text message. Each postcard was stored in the postcards table, with each row in the table representing a single postcard. By giving each row a unique ID number, our program was able to retrieve individual postcards from the database.
Indeed, the program send-postcard.pl that we examined last month demonstrated how it is possible to take inputs from an HTML form and turn them into a postcard stored in our database, as well as how to take that unique ID number and use it to display a postcard with show-postcard.pl. The unique ID number is generated by send-postcard.pl when it creates a new row in the postcards table and ensures that users can retrieve their own postcards while making it relatively difficult to find those addressed to other people.
The programs we are examining are written in a portable enough version of SQL that they should run on just about any relational database server. However, I have only tested these programs using MySQL, a nice, small relational database product for several versions of Unix, including Linux. You can find out more about MySQL on the Web at http://www.tcx.se/.
One of the problems with that version of send-postcard.pl was that it asked the user to name a graphic file to be inserted into the postcard she was sending. While this sounds like a nice idea, there were several problems with it, most notably the fact that the user could name a file that did not exist on the server. Moreover, there was no way for the user to know which graphics did exist on the server, short of writing another program that would present a directory listing.
Additionally, storing the file names for the graphics alongside the other postcard information is inefficient and can lead to synchronization problems. After all, what happens if you decide to change the name of a file from foo.gif to bar.gif? Under our current model, we would have to go through the entire “postcards” table and rename each row in which foo.gif appeared. This is not terribly difficult to do, but it means we have to think carefully before making such a change.
Finally, storing the names of the graphics files introduces the possibility of typographical errors. A user might inadvertently type fooo.gif, and thus prevent the postcard's recipient from seeing the desired graphic.
In order to solve all of these problems, we will move the names of the graphics files to a separate table, giving each graphic a unique ID number, much as we did for each postcard. With this system in place, we can then refer to graphics by number in the postcards table. We can still refer to an infinite number of graphics in this way, but we do so by using an integer, rather than a string, saving storage space on our file system. In addition, such consolidation allows us to change one or more of the graphics file name(s) or location(s) without having to worry about ruining our program's output.
We can retrieve one or more rows from a table in our database using the SQL SELECT command, which returns all of the rows matching our selection criteria. Thus, if we are interested in retrieving the row with the ID number of 12345 from the postcards table, we can do so by sending the query:
select * from postcards where id_number = 12345
This will return a small table, the rows of which all have an id_number equal to “12345”. Since id_number is guaranteed to be unique, we can expect that the table returned by the SELECT command will consist of a single row and all of the columns contained in the postcards table. If id_number was not guaranteed to be unique, then SELECT might well return a number of rows from the table, rather than only one.
We can wrap SQL queries inside of programs, and CGI programs are full-fledged programs; thus, we can easily write a CGI program that:
takes an argument from a user,
uses that argument in the SELECT statement sent to the database server, and
gives us a new SELECT statement each time we run the program.
Let's create a new table, in which the first column uniquely identifies the graphic, and the second column names the graphic. I do this using the interactive mysql program that comes with MySQL, which I enter by typing mysql<\!s>test. To create the new table, enter the following lines:
create table graphics ( graphic_id mediumint not null primary key, graphic_file varchar(60) not null)
The interactive mysql program indicates that the command was executed successfully by giving us the following message:
Query OK, 0 rows affected (1.13 sec)We have now created a table with two columns. The first, graphic_id, stores medium-sized integers, cannot be null and must be unique (“primary key” in SQL lingo). The second column stores a variable-length string containing up to 60 characters and must be non-null.
We can now insert a graphic into the graphics table with the following command:
INSERT INTO graphics (graphic_id,graphic_file) VALUES (12345, foo.gif);
Now if we ask for the contents of the graphics table, we get the following:
mysql> select * from graphics; 1 rows in set (0.04 sec) +------------+--------------+ | graphic_id | graphic_file | +------------+--------------+ | 12345 | foo.gif | +------------+--------------+We have now proven it is possible to store a record of a particular graphics files, as well as to retrieve the name of that file based on a unique ID number. Now comes the hard part—making an association between the ID and the graphics file name.
First of all, we need to modify the definition of our postcards table, such that it now expects to get the ID number of a graphic, rather than the graphics file name. To create the postcards table, use the following SQL command:
create table postcards ( id_number mediumint not null primary key, sender_name varchar(60) not null, sender_email varchar(50) not null, recipient_name varchar(60) not null, recipient_email varchar(50) not null, graphic_id mediumint null, postcard_text blob);
Notice that we have removed the graphic_name column from the table we defined last month and replaced it with a column named graphic_id—which has the same name, size and type in both of the tables. We don't need to give the columns identical names, but doing so makes it easier for us to keep track of things.
The most straightforward way to replace the old graphic_name column with the new graphic_id column is to create the table anew, as demonstrated with the above SQL command. However, this action results in the loss of any data already in our table. In the beginning of a project that uses a database, creating and destroying tables is quite normal—at least until you get the definition right. After a table has been used for a while, you probably don't wish to destroy your data.
To solve this problem, we simply add our new column (graphic_id) to the postcards table using the alter table command, which allows us to add one or more columns to an existing table. This is not the most efficient way to design a database, but it eliminates worry about programs which might be dependent on graphic_name.
Our database now contains two tables: “postcards”, with all of the information required to assemble a complete postcard (including the ID number of the graphic we wish to display), and “graphics”, which maps those ID numbers to the names.
How do we create an association between the two? We already know that it is possible to use SQL's SELECT command to retrieve one or more columns from a table. SELECT can also be used to retrieve columns from multiple tables and can even join columns together from them.
For instance, let's get the sender name, recipient address and graphic name from our two tables, such that the graphic name for each postcard is displayed immediately next to the unique ID and recipient address.
First, let's add another graphic file to the graphics table, just for good measure:
mysql> INSERT INTO graphics ( graphic_id,graphic_file) VALUES (67890, bar.gif); Query OK, 1 rows affected (0.00 sec) mysql> select * from graphics; 2 rows in set (0.16 sec) +------------+--------------+ | graphic_id | graphic_file | +------------+--------------+ | 12345 | foo.gif | | 67890 | bar.gif | +------------+--------------+
Let's now create a postcard to use for our example:
insert into postcards (id_number, sender_name, sender_email, recipient_name, recipient_email, postcard_text, graphic_id) VALUES (99999, "Bill Clinton", "president@whitehouse.gov", "Al Gore", "vice.president@whitehouse.gov", "Please call. I have a new tax idea.", 12345)Notice that because they are integers, neither the postcard ID number nor the graphic ID number is surrounded by quotation marks. Also, notice how the initial list of columns in our INSERT command includes graphic_id, but leaves out graphic_text. Since graphic_text in postcards can contain null values, we can ignore it when inserting new rows into the table.
Now let's retrieve the sender name, recipient name and graphic file name for the postcard we just created:
mysql> select postcards.sender_name, postcards.recipient_name,graphics.graphic_file -> from postcards,graphics -> where postcards.id_number = 99999; 2 rows in set (0.31 sec) +--------------+----------------+--------------+ | sender_name | recipient_name | graphic_file | +--------------+----------------+--------------+ | Bill Clinton | Al Gore | foo.gif | | Bill Clinton | Al Gore | bar.gif | +--------------+----------------+--------------+
Egad—this isn't what we wanted at all. We wanted it to bring up the file name for the ID we specified, not the combination of all rows and columns in the “graphics” table with our postcard. That's what we wanted to do, but that's not what we told the computer to do. By formulating our SQL query as above, we inadvertently asked for all possible combinations of rows from postcards with rows from graphics, this combination is known in database circles as the “Cartesian product” of the two. While asking for a Cartesian product is not an error, it is almost always undesirable. When dealing with especially large tables, asking for such a combination of tables can result in a long, unnecessary computation that ties up the database server keeping it from performing other tasks.
How can we modify our database query so that it does what we originally wanted, namely, giving us the graphics file name in place of its ID number in the postcards table? The simplest way is to set up a restriction between the two tables, adding to the WHERE clause in the query, as follows:
mysql> select postcards.sender_name, postcards.recipient_name,graphics.graphic_file -> from postcards,graphics -> where postcards.id_number = 99999 -> and postcards.graphic_id = graphics.graphic_id; 1 rows in set (0.10 sec) +--------------+----------------+--------------+ | sender_name | recipient_name | graphic_file | +--------------+----------------+--------------+ | Bill Clinton | Al Gore | foo.gif | +--------------+----------------+--------------+
Now that's more like it. By requiring the equivalence between the graphic_id columns in both tables, we retrieved the information as if it came from a single table. That's part of the magic of SQL and relational databases. By combining tables in this way, you can make your data easier to handle by putting it in separate tables. However, when you retrieve the information, no one knows it came from separate tables, since a new, temporary table is returned to the caller.
This may seem like a silly example, but imagine a corporation with very strict pay scales whose payroll is handled by a relational database. If you give every employee an ID number indicating salary, you can give everyone a raise (or a reduction, depending on the company's financial state) by updating records in a salary table, rather than the table of employees. The next time you perform a join between the employee and salary tables, the new salary will be reflected automatically.
Now that we have seen how to get our queries to work at the SQL level, let's think about the necessary steps needed to integrate these queries into some CGI programs. For the most part, our CGI programs do not need many changes. We need to modify send-postcard.pl so that it inserts the graphic ID into the postcards table, rather than the graphics file name, and show-postcard.pl needs to use the SQL query that we formulated above in order to get the graphic file name from the graphics table in addition to the information in the postcards table. The revised versions of the code are not completely reprinted this month. These two listings along with the listing that is printed are available by anonymous download in the file ftp.linuxjournal.com/pub/lj/listings/issue43/2508.tgz.
First, we'll look at the revised version of show-postcard.pl. The only change to be made to the listing printed last month is in the SQL query, which now reflects the new table:
my $command = ""; $command = "select postcards.sender_name,"; $command .= "postcards.sender_email,"; $command .= "postcards.recipient_name,"; $command .= "graphics.graphic_file,"; $command .= "postcards.postcard_text from "; $command .= "postcards,graphics "; $command .= "where id_number = $id"; $command .= "and postcards.graphic_id = "; $command .= "graphics.graphic_id";
Only this one change is necessary, because of the way in which we wrote the original version of show-postcard.pl. By contrast, imagine how much code we would have needed to rewrite if we had initially stored the information in a single ASCII text file, and then split the information between two files.
Our modifications to send-postcard.pl is almost as easy. We need to add the definition of $graphic_id, rather than $graphic_name, at the top of the file:
my $graphic_id = $query->param("graphic_id");
When we insert the postcard into the postcards table, we must modify the code so it uses the graphic_id column and variable, rather than graphic_name:
$command = "insert into postcards "; $command .= " (id_number, sender_name, "; $command .= " (sender_email, recipient_name, "; $command .= " recipient_email, graphic_id, "; $command .= " postcard_text) "; $command .= "values "; $command .= " ($id_number, \"$sender_name\", "; $command .= " \"$sender_email\", "; $command .= " \"$recipient_name\", "; $command .= " \"$recipient_email\", "; $command .= " \"$graphic_id\", "; $command .= " \"$postcard_text\") ";With those modifications in place, we are done. Now our code will work just fine with the new table, storing and retrieving graphics according to their ID.
There is one remaining problem with this version of the code. How is a visitor to our site supposed to know or remember the ID numbers for the various graphics that are available? We could modify the HTML form to provide this information, but it seems a bit silly for us to do so, since we would then have to update the form each time we updated the table.
The simplest solution is to write a small CGI program that produces the HTML form, inserting the values as appropriate. There are a number of different ways to allow the user to choose, but I decided when writing this program to take a relatively easy path by using radio buttons. A more aesthetically minded programmer (or one who expected to have a lot of graphics files) may have chosen a selection list, but that's a side issue. The resulting program, postcard-form.pl, is shown in Listing 1.
That about does it for our postcard-sending problem. There are, of course, many other ways in which this set of programs could be extended or modified. For example, it might be a good idea to create a CGI program that would allow us to enter and edit the file names in the graphics table, so that we would not have to use the interactive mysql program for such modifications. Currently, only someone knowledgeable in SQL can add, modify and delete elements in the graphics table. We could also ensure that the ID numbers in the graphics table are given sequentially; some relational database vendors provide that facility, allowing for “identity” columns that automatically increment as new rows are added.
It would also be nice to allow users to preview the graphics they place on the postcards, or at least describe the pictures rather than just presenting the users with file names. This option might require storing two versions of each graphic or adding another column to the graphics table that would be used for descriptions or previews.
The possibilities, as you can tell, are unlimited—and this is a relatively small project.
This article ends our whirlwind tour of SQL, although future columns will undoubtedly continue to use relational databases as a means for storing information. Next month, though, we will look at the efficiency of our CGI programs, including the “CGI lite” module for Perl.