An introduction to building a database using SQL in Perl and CGI.
In my work, I often find myself writing CGI programs that need to read or write information on the file system attached to a web server. Sometimes, this information is fairly simple, throwaway stuff, such as logging information accrued when I am trying to debug a particularly difficult program.
Sometimes, as we saw in a series of columns earlier this year, we can use text files for the storage and retrieval of structured information, such as the questions and answers for a multiple-choice quiz. Those quizzes were stored in a simple format, with each question placed on a line by itself. For example, here is a line that might have come from one of those quizzes:
What color was George Washington's white horse? White Black Gray Pink a
While the mechanics of magazine publishing mean that you cannot see the difference between various whitespace characters, the above line is separated into six fields: The question text, the four answers presented to the user, and a letter (a, b, c or d) indicating which of the four answers is correct. Fields are separated by Tab characters (ASCII 9), which look identical to space characters (ASCII 32), but which are quite different as far as the computer is concerned.
The quiz programs we explored earlier this year expected to read from files containing one or more such lines, with each line representing a single question. A quiz containing a single question (for users who prefer easy challenges) would have one line, while a quiz containing 1000 questions would contain 1000 lines.
This raises the important issue of scalability, the software's ability to remain efficient even when data sets become quite large. It is not difficult to write programs that can handle small amounts of data efficiently, particularly as hardware continues to drop in price while increasing in performance. It is much harder, though, to write software that can handle large amounts of data.
ASCII text files are wonderful when dealing with small amounts of data, since they are easy to manipulate from within programs, particularly when using Perl, which is strong in handling regular expressions. But when we must work with a large amount of data, or when we want to perform sophisticated searches, we may find ourselves reinventing the wheel or working with tools (such as ASCII text files) that no longer fit our needs.
A common solution to this problem is to off-load the data storage and retrieval to a program known as a relational database server. The “server” part of the name indicates that it expects to receive requests from one or more clients, and the “database” part of the name indicates its storage and retrieval of information on behalf of those clients. You may, however, be unfamiliar with the “relational” part of the name, which means that data is stored in sets of tables, which we can access using SQL, the Structured Query Language. This month, we take a first look at SQL queries, including how they can be integrated into our CGI programs; in the coming months, we will explore this topic in greater depth, using relational database servers for a variety of projects.
SQL is an international standard to which many corporate databases adhere. While the “L” in SQL stands for “language”, it does not mean that you can write programs in SQL. Rather, SQL is a language for formulating queries to database servers. The SQL commands must be incorporated into programs written in a true programming language, such as Perl or C.
Relational databases work on the client-server model, just as the Web does. Whereas web clients and servers communicate using HTTP, database clients and servers communicate with SQL. Needless to say, SQL is much more complicated than HTTP, although as you will see, it is fairly straightforward to learn. SQL may be easy to learn, but that does not mean it is simple. On the contrary, long-time database administrators and programmers understand more about the storage and retrieval of data using SQL than I could ever imagine.
The key to understanding SQL is to realize that everything in an SQL database is stored in a table. Rows in the table represent table records, while columns represent fields. Thus, we could represent an address book as a table.
Name Telephone ---- --------- Reuven 04-824-2265 Andy 02-123-4567 Gil 04-999-8888
There are three records in this table, each represented by a row. Each record contains two fields, each represented by a column. Each table and column must have a name, so we will call this the “phone_book” table, with two columns, “name” and “telephone”.
So far, this might not seem like a great advance over what we have done with text files. Why bother with rows, columns, and tables when we can use an ASCII file?
The simple answer is that we can allow the database server to do the work for us—and it will return an answer to us very quickly, as per our instructions, without getting bogged down by the number of records in the database. If we were interested in finding Andy's telephone number with a text file version of the above table, we would need to iterate through the entire file, checking each record for a match. With a relational database, we can issue an SQL query to the database server, asking for only those rows which match our particular criteria.
Thus, if we were interested in retrieving Andy's telephone number from the table above, we could use an SQL select command to do so:
select telephone from phone_book where name = "Andy";
The SQL statement above asks the database server to return the telephone column from the table named phone_book, for each row in which the name is Andy. If a single row matches the query, we receive a single row as a response from the database server, but if multiple rows match, we receive all of those rows. If no row matches our query, we receive no rows, which might seem odd, until you realize that database client programs often iterate over the results returned to them. Iterating over no values is as easy as iterating over 100 values, although most good client programs check to make sure that at least one row was returned.
We can insert a row into our table with the following:
insert into phone_book (name,telephone) values ("Iris","04-999-8888");
After performing the operation above, our table looks like:
Name Telephone---- ---------Reuven 04-824-2265Andy 02-123-4567Gil 04-999-8888Iris 04-999-8888
which we can see by retrieving everything, using an asterisk to mean “all columns”:
select * from phone_book;
If we want to retrieve all of the rows belonging to people with the telephone number 04-999-8888, we use this line:
select name from phone_book where telephone = "04-999-8888";Note that we do not need to worry about two identical records, since relational databases strictly require each record to be unique in some way. Two rows might differ in only a single column, but that column is enough to make the rows distinct.
One advantage, then, of using SQL and a relational database server is the increased efficiency, both of our programs (which no longer need to read the entire contents of a text file) and ourselves (since we no longer need to write matching engines and define data formats). There are other advantages to using SQL and relational databases too; most importantly, database servers handle file locking in a sophisticated and efficient way, ensuring that data is not lost while keeping operations moving along quickly.
Relational databases also offer an amazing array of optimization techniques and security levels, among other things. And best of all, SQL is a portable standard that (mostly) works in the same way on a great many database systems; that is, once you learn how to write some basic SQL queries, you will be able to store your data on just about any available platform.
Most of my SQL experience is with Sybase on Solaris systems, but for the purposes of this article, I decided the time had come to install a relational database server on my Linux machine (running Red Hat 4.0 with a number of updated packages, including the 2.0.30 kernel). I decided to download MySQL, a database server that looked small but powerful, and which came in RPM (Red Hat Package Manager) format, allowing me to install it quickly. (Don't confuse MySQL with mSQL, another relational database package available for Linux. For information on how to obtain MySQL, see the sidebar accompanying this article.)
MySQL comes with a client program named, oddly enough, mysql, which allows us to enter SQL queries directly to the database server, which is presumably running at all times. We enter the database with:
[1016] ~% mysql test Welcome to the mysql monitor. Commands end with ; or \g. Type 'help' for help. mysql>
Just as file systems store files within subdirectories within directories, relational databases store tables inside of databases inside of the overall structure. Thus, when we enter MySQL, we need to specify the name of the database we would like to use. In example above, we specified the test database, to which all users have access without needing to go through the standard procedure of entering a user name and password. While user names and passwords for relational databases can be the same as those for the user's account on the system, they do not need to be. Indeed, for the sake of system security, you should make them distinct from your regular system passwords.
Generally speaking, it is also a good idea to create one or more databases exclusively for CGI programs, in order to avoid giving programs complete access to all databases on the system. The nature of CGI programming is such that users might be able to read the user name and password from the program's source code, thus giving them access to whatever tables are in a given database. However, in the interest of time and space, I encourage you to read the MySQL documentation, which describes how to set user permissions for various databases on the system In the meantime, we will use the test database, to which all users have access, for our examples
To create our telephone directory table, we type:
mysql> create table phone_book (name char(255), telephone char(255));
Whitespace is unimportant in SQL queries. In the above example, I pressed enter between the end of the first line and the go statement on the second line. As you might expect, the go command tells a database client to send the query to the database server, where it is evaluated and executed. Alternatively, we can use a semicolon at the end of our query, which will preclude the need for go.
The server responds to our query by giving us some statistics:
Query OK, 0 rows affected (0.27 sec)
In other words, creating a table took .27 seconds and did not affect any existing rows.
You can quit mysql by typing quit at the mysql> prompt.
The MySQL programmatic interface from Perl works in much the same way as the command-line program, except that it uses Perl 5 objects. The basic idea is straightforward; we create an instance of a MySQL object, and then use that object to get through the process of logging in, sending queries, and interpreting the results.
Listing 1 contains a functional program that can query our phone_book table and return the results. More importantly, though, that program is the skeleton for every program we write using MySQL. While the syntax might be slightly different for Sybase and other databases, the general idea is the same—connect to the database server, choose a database, send a query in SQL and iterate through whatever results are returned.
First, we connect to the database server using Unix sockets, in part because MySQL enables those sockets by default, which makes for an easier explanation in a short column such as this one. You can, of course, also connect to a database server running elsewhere on the network, just as a web browser can connect to a web server across a network.
Once we are connected to the MySQL server, we use the query method to enter our SQL query. Just as connecting to the database returns the database handle $dbh, sending an SQL query returns the statement handle $sth. And just as we need to use $dbh in order to send a statement, we need to use $sth in order to retrieve results. In this particular statement, we have asked to see both of the table's columns, as well as all of the rows in the table. However, we could restrict our query with a where clause, as described earlier, which would return a subset of the table's rows. We could also ask for a subset of the table's columns, such that only the name or the telephone number would be returned.
Results are retrieved by iterating over the rows that were returned from the server. If no rows match our query, the iteration is not performed; if 100 rows match our query, it is performed 100 times. If we are interested in maximizing the efficiency of our programs that handle SQL queries, it is in our interest to construct queries that return only those rows that most interest us, since iterating through a large number of rows can be quite inefficient and time-consuming.
If I run the program in Listing 1 (named sql-test.pl on my system) from the command line, I get:
[1031] ~/Text/LJ% ./sql-test.pl Iris 04-999-8888 Reuven 04-824-2265 Andy 02-123-4567 Gil 04-999-8888
We can, of course, use the above skeleton program to insert rows, create tables and do more complicated things, such as joining tables together (which is, to a large degree, the magic behind SQL) and order results in ascending or descending order. If we were to keep the area code in a different column from the telephone number itself, we could refine our searches even further, asking for all people within a given area code whose first name is Iris, for example.
Now that we have seen some basic uses of MySQL from within Perl, let's spend some time thinking about how we can integrate the use of MySQL into a CGI program. While this might seem like overkill for some small jobs, database servers are so much more reliable and efficient at this sort of task than our CGI programs that it is almost always worth using such a server, assuming one is available.
By using a database server, we can be sure that our data is stored more reliably than with text files. As an added bonus, the information is available using SQL, which is more efficient and flexible than text files.
How can we use a database server from within our CGI programs? The simple answer is that it is actually no different from connecting to a database server from within non-CGI programs. We still create the Mysql object, use its methods to send an SQL query and retrieve results. The differences are in our ability to modify our query based on input sent to us in an HTML form and the necessity of sending our output to the user's browser using a recognized content type (usually HTML). Such a program, which I have called cgi-sql-test.pl is shown in Listing 2.
While cgi-sql-test.pl is longer than the program on which it is based, it is not much more complicated.
First, we fire up the CGI module for Perl, which you can get via the Comprehensive Perl Archive Network (CPAN) at http://www.perl.com/CPAN. After creating an instance of CGI, we send an HTTP Content-type header to the user's browser indicating that we will be returning results of type text/html, i.e., HTML-formatted text.
Following our initialization of the CGI environment, we go ahead with what we had done in the non-CGI version of the program, namely connecting to the database, sending our query and retrieving the results.
This is where the big difference lies. Rather than printing the results to standard output, we send them in HTML format to the user's browser, so that we can use all sorts of nifty HTML formatting techniques to display the results.
In this particular example, I decided to put the results of the telephone list in an HTML table, which is attractive and makes it easy to understand the results. The <tr> tag introduces a table row, while the <td> tag introduces a column within a row. Because each iteration through the while loop represents a new record in the database, we can start a new HTML row at the top of each loop, ending it at the bottom of each loop.
We will continue to explore the interaction between SQL and CGI in the next few installments, but before I conclude this month's column, I want to show at least one example of how we can modify the SQL queries based on the user's input. For the sake of simplicity, we modify our program such that it will ask the database server to return only those rows whose name column matches what we enter in the query string. Thus, if we are interested in finding out Gil's telephone number, we can go to:
/cgi-bin/cgi-sql-test.pl?Gil
And if we are interested in finding out Andy's telephone number, we can go to:
/cgi-bin/cgi-sql-test.pl?Andywhich produces only that listing.
But what happens if someone invokes our program without entering a name in the query string? Well, our program cleverly notices it and produces a very small page of HTML in response. This small page of HTML asks the user to enter a name for which to search and then uses the <isindex> tag to create a text field in the page of HTML.
The <isindex> tag has generally fallen out of favor, since HTML forms are more flexible and useful. When a user enters information into an <isindex> field and presses enter, the URL in which the <isindex> tag appeared is reloaded—with the user's input appended as part of the query string.
Thus, if our program receives no input in the query string, it produces a page containing <isindex>. Whatever the user enters in that text field causes our program to be reloaded, this time with a value in the query string. That value is picked up by our program and passed to MySQL, which returns the results in an HTML table.
That concludes the basic introduction regarding the integration of SQL and CGI programs. As you might imagine, SQL databases are far more powerful than the programs and databases we have seen this month. Over the next few months, we will spend some more time looking at different ways in which we can use MySQL (and relational database servers in general) to make for more interesting, efficient and useful web sites.