If you haven't already, perhaps it's time to see if PostgreSQL is the database for you.
When I first began to use Linux for basic Web development, I saw that my three primary tools—Perl, GNU Emacs and Apache—were already included. But, at least one thing was missing, namely a relational database. I had used databases (mostly Sybase) for about a year when I began working with Linux, and knew that I would need a good database server in order to create sophisticated web sites.
I had heard of PostgreSQL and, after learning a bit more about it, decided to install it. Unfortunately, my installation experience was less than pleasant, and I gave up after several hours. Another reason I gave up was that I found another database server, MySQL. MySQL did not have all of the features I wanted in a database, but it was close enough; I implemented many web applications for clients using it.
Things have changed quite a bit in the last five years: MySQL has been rereleased under the GNU General Public License and is slated to include basic support for transactions, while PostgreSQL is now remarkably easy to install and includes a wealth of features and functionality. Both are well known in the Free Software community as powerful programs that can help get your work done.
I still use MySQL for a variety of tasks and expect to continue doing so, but, increasingly, I find that PostgreSQL is a better fit for my needs. This month, we will look at PostgreSQL, starting with its basic features, to create a small web-based application that uses transactions. Along the way, I will try to compare it with MySQL, describing where one product might be better suited than the other.
As of this writing, the latest version of PostgreSQL is 7.0.2, released in the spring of 2000. As with all open-source software, you can download the PostgreSQL source code from the Internet and compile it yourself. My office uses the Red Hat distribution, and I generally prefer to install software with RPMs for easier maintenance. We downloaded the RPMs from the PostgreSQL web site, installed them, and were up and running in almost no time.
Like all modern database systems, PostgreSQL contains a server that can handle connections from multiple clients. Typically, only one computer in a network is designated to be the database server, with the remaining computers configured as clients. (The server is often configured to be a client as well, in order to facilitate debugging and system configuration.) RPMs for the server typically begin with the name “postgresql-server”, while the client RPMs are named “postgresql” followed by the version number.
PostgreSQL clients exist—in source form and as RPMs—for most of the programming languages that people use in designing web applications, including Perl, Python, Java and PHP. If you intend to build any of these from scratch, you will need to install the PostgreSQL development libraries, either from source code or from the RPMs. I compiled Perl and its modules from the source code but, otherwise, took advantage of the RPMs and installed the precompiled binaries.
Like MySQL and many other relational databases, tables within PostgreSQL are grouped into a single “database”, much as files are grouped within a directory. PostgreSQL offers a great deal of flexibility when it comes to security configurations. Databases can allow or deny access based on IP address or user name, while individual tables and other objects allow various levels of access based on the user name. These configurations are performed in the pg_hba.conf file, which was installed by default in /var/lib/pgsql/data on my system.
By default, only one user, called “postgres”, is authorized to create new users or to create new databases. When you first start to use PostgreSQL, you will probably have to use su to change your identity to root, then su again to change your identity to “postgres”. Once you have become the postgres user, you can use the createuser program to create one or more new users. You can then indicate whether they are allowed to create new databases and/or new users.
The psql database client which comes with PostgreSQL is an excellent interactive tool for working with the database. Like the MySQL client program, it uses GNU ReadLine to provide Emacs-compatible keybindings for entering SQL queries directly. psql also provides an extensive number of help commands that begin with backslashes, such as \h (which displays help for any SQL command), \d (which lists available objects and details about those objects), and \l (which lists available databases). I use psql constantly to double-check that program-generated queries worked correctly; it is easy and quick to use.
PostgreSQL implements a large portion of standard SQL and, therefore, is easy to learn if you have previously worked with a relational database. We can create a simple table, as shown in Listing 1.
The serial data type is similar to MySQL's AUTO_INCREMENT tag. It provides us with a unique number each time we insert a new row into the table. serial columns use a PostgreSQL data type called a “sequence” on which the PostgreSQL nextval and currval functions operate. It is also possible to use a sequence directly (see Listing 2).
PostgreSQL distinguishes between single and double quotes, so be sure to say nextval(`people_id') and not nextval("people_id").
Inside of psql, the semicolon is a synonym for \g, meaning “go and execute this query”. Outside of psql, it has no meaning and may even cause an error in your database driver.
PostgreSQL, unlike MySQL, is case insensitive when it comes to table and column names. However, I prefer to follow Joe Celko's capitalization rules for SQL: keywords in ALL CAPS, table names in LeadingCaps, and column names in all_lowercase_with_underscores.
As in MySQL, PostgreSQL allows us to designate which column is the primary key. PostgreSQL also allows for unique columns (and combinations of columns), as well as the creation of indices with the create index statement.
PostgreSQL's data types are slightly different from those in MySQL, but relatively easy to understand if you have worked with another database. varchar and numeric types are supported. Perhaps the most confusing difference between data types in MySQL and PostgreSQL is timestamp. Under MySQL, a timestamp column is automatically set to the value of the latest insert or update. In PostgreSQL, a timestamp column simply contains a date/time value.
PostgreSQL supports many standard SQL functions and many of the extensions that MySQL contains. The difference, of course, is in how they are implemented. For example, MySQL has a regexp function similar like function. PostgreSQL, by contrast, has implemented regexp functionality with the ~ (tilde) operator, perhaps because of Perl's popularity.
PostgreSQL also makes it possible to create new datatypes with the create type function. Indeed, one of PostgreSQL's claims to fame is that it is a hybrid of object-oriented and relational databases. I have never needed to use this functional, but it does seem to be an intriguing and powerful feature.
So far, PostgreSQL does not seem to be very different from MySQL, except in some of its basic syntax. But, it is here that the two databases begin to diverge. PostgreSQL includes what are known as “referential integrity” checks, meaning that I can define certain values in a row cand be defined as illegal.
One of the first things that a database programmer learns is that null indicates that a column contains no value—not even false, the empty string or zero. We can forbid a column to contain “null” by declaring the column to be “not null”. This is perhaps the simplest integrity check, with the database ensuring that the column in question can never contain an illegal value. In our People table above, the name column is defined as not null, telling the database that every person must have a name.
Many times, however, this is not enough. For example, the People table includes an e-mail column. Modern e-mail addresses must contain an @ sign in order to be valid. Using referential integrity, we can ensure that any e-mail address added to the database contain an @ sign. In order to do this, we use PostgreSQL's Perl-like ~ operator, which matches a string with a regular expression:
email VARCHAR(50) NOT NULL CHECK (contact_email ~ '@')
Without an @, an entered (or updated) value will be considered illegal and generate an error code. Having the database flag such an error might seem frustrating, but it is certainly better than having a database with incorrect values. I often use such checks to ensure that columns are not null and to forbid the empty string or other illegal values. For example, it is possible to define the People table with referential integrity checks (see Listing 3).
If I try to insert a row which violates any of these checks, PostgreSQL will refuse to do so (see Listing 4).
So it seems that my value for address2 was invalid. Indeed, I tried to enter an empty string here, which is not allowed. Rather, I should have entered a null value (remember, null and the empty string are distinct values). Sure enough, replacing the empty string with null allows the query to succeed. But the database refused to allow us to corrupt it with invalid information and did not insert the new row.
Referential integrity also means that one table can reliably point to another with “foreign keys”. Most tables have a primary key, meaning a column (or set of columns) that uniquely identifies each row. For example, the databases of the United States Social Security Administration uniquely identify each American citizen with a Social Security Number. This number means that you can change your name, address, phone number, bank accounts and job, but the same SSN will still refer to you. In the same way, a primary key allows us to continue pointing to a particular row in a table without having to depend on any of the values in that row.
For example, let's assume that we want to create an Appointments table containing three columns (see Listing 5).
The table in Listing 5 allows us to indicate when we are meeting with each person, ensuring that the notes column is either null or non-empty, and that only one appointment can take place at a time. The person_id column is supposed to contain the person_id from the People table. However, what stops me from entering a person_id of five or 50? How can I be sure that the value is valid?
The answer is that we can set person_id to be a “foreign key” of People, meaning that People.person_id can only contain a value that is contained in People.person_id. We can add this constraint with the REFERENCES keyword (see Listing 6).
One of the chief complaints that PostgreSQL users level against MySQL is the lack of transactions. If you have only used MySQL in your development work, you may wonder why you need transactions and how they fit into a database environment.
The basic idea behind transactions is that they group multiple queries into a single logical query. If any of the queries in the transaction should fail, the database is “rolled back” to where things were before the transaction started.
Thanks to transactions, you can be sure that a transfer of money from one account to another will not accidentally leave you with too much or too little money, even if the power fails in the middle of the transaction. Until the transaction is finally “committed”, the database pretends that none of it has happened.
The MySQL documentation (and authors and support people) has its own philosophy of transactions, including commit and rollback, which runs counter to the “ACID” test that is prevalent among current relational databases. While I disagree with some of their conclusions, there is no doubt that the lack of conventional transactions in MySQL has made it a flexible and fast database, one that is well suited to web sites that perform many selects and few inserts and updates.
PostgreSQL follows the standard model fairly closely, making it possible to perform transactions without locking tables (as in the MySQL paradigm). To begin a transaction, use the being work statement, and to end one use either the commit or rollback statement. Users of Perl's DBI or of Java's JDBC can instead use the commit and rollback methods associated with the database connection object. Both DBI and JDBC operate by default in AutoCommit mode, meaning that each query is implicitly placed within its own transaction. To put several queries inside of a single transaction, a program must turn off AutoCommit mode, perform the transaction, perform a commit or rollback, and then (usually) turn AutoCommit mode back on. For example, let's assume that we have a separate Salaries table that indicates every employee's salary (see Listing 7).
Notice how the above table ensures that each employee can get only one raise on a given day, by setting a unique restriction on the combination of an employee and day.
We could presumably keep this information in the People table. However, putting salary information in a separate table makes it easier to hide the information from prying eyes. It also means that we can pull up an employee's entire salary history with a SELECT statement, while keeping the tables normalized and storing information about each person only once.
Having two tables for a single employee raises some issues. Most significantly, we want to be sure that any employee added to the People table will also be added to the Salaries table (it would be rather embarrassing to have an employee without any salary). Adding a new employee should be one logical operation but will require two insert statements—one into People and the other into Salaries. What happens if the database dies in the middle of the second statement?
Listing 8 contains a simple command line program (which could easily be turned into a CGI program) that creates a new employee, first adding a new row into the People table, and then adding a corresponding row to the Salaries table. We retrieve the person_id of the newly inserted employee using PostgreSQL's currval function. We then use that value to INSERT a row into the Salaries table.
We ensure that the two operations occur within a single transaction by turning AutoCommit mode off (setting it to 0). Once this happens, we are responsible for performing a commit when we are done. Without a call to $dbh->commit, PostgreSQL will assume that we want to roll back both of the insert operations, pretending that they never happened. If the program dies in the middle—which will happen if any of the SQL queries fails, since we have activated RaiseError—none of the changes will occur.
To trap the error and display a message to the user, we can use the block version of Perl's eval function, demonstrated in Listing 9. This runs the code inside of the {}, exiting from it, and setting the special variable $@ if anything goes wrong. This technique of using eval to trap errors gives us a basic form of exception handling and enables us to print out only the errors we want. If we were to activate either the PrintError attribute or the “use diagnostics” pragma, Perl would print out more than just our simple message, confusing the user.
We have now defined three tables: People, Appointments and Salaries. What happens if I want to create a table listing all appointments scheduled with people, along with their salary histories? (This would be useful to have before annual salary meetings.) The table in Listing 11 will perform such a request, listing appointments in chronological order.
Rather than having to create this query from scratch each time, we can create it as a “view”. Views are dynamically generated tables with names attached that can largely be treated as read-only tables. We can create a view of appointments and salary information, as shown in Listing 10.
Notice how there is almost no difference between the generic select statement and the select used to create a view. Indeed, the only change that we made was in dropping the order by clause, because PostgreSQL has not implemented this functionality as of version 7.0.2. So, we can list all appointments in chronological order (see Listing 11).
Views have a number of advantages over simple select statements:
They force more of the processing to take place on the database server. Since database servers are typically high-end machines, and because they have ready access to the data, they end up doing more of the work. The client spends less time creating dynamically generated SQL statements.
Views have their own permission structure. For example, the personnel department at a company, but no one else, should have access to salary information. With views, it is possible to hide information and allow particular users or IP addresses to access the base table, but keep the view open to the general public.
Perhaps most importantly, views let us think at a higher level of abstraction than basic tables. Views can perform a variety of calculations and manipulations on the values from our table, just as a simple select can. If you know that you will have to multiply all of the values in a particular column by three, you can create a new view which automatically performs the calculation. You no longer need to perform the calculation in the select statement, nor in the program that retrieves the values.
Views cannot take variable arguments, meaning that you cannot create a view that sometimes retrieves user names beginning with “A” and sometimes retrieves those that begin with “B”. To perform such an operation, you will need to create a procedure. PostgreSQL supports a variety of programming languages in which procedures can be written, including Pl/PGsql (which is similar to Oracle's PL/SQL language), PL-perl (procedures written in Perl), and Pl-tcl (procedures written in Tcl).
Once you have created procedures, you can then create “triggers”. A trigger is a procedure automatically executed when something happens in the system. For example, you can use a trigger to ensure that when a user is deleted from the People table, rows in the Salaries and Appointments tables refer to that user are also deleted. Without this, it is possible for a row in Salaries or Appointments to refer to a person_id that no longer exists. Triggers can be activated whenever someone performs an insert, update, or delete on a table, and can operate either before or after the action occurs.
Finally, views are normally read-only objects, since they are simply aliases for select queries. However, PostgreSQL has a sophisticated rule system that makes it possible to rewrite queries that fit certain criteria. Using rules, you can intercept an insert, update or delete that is targeted at a view, and rewrite it as a series of operations on one or more tables. Thus, an insert into ApptAndSalaryView could not be rewritten, since the user's e-mail address (from People) does not appear. However, an update would certainly make sense, and could be rerouted to modify the People, Appointments or Salaries tables as necessary.
While PostgreSQL is rapidly growing in popularity, it does have problems. The PostgreSQL development team is well aware of these problems and seems to be dealing with them quickly.
The most pressing issue is probably the 8KB limitation on each tuple, or database row. This means that no row can contain more than 8KB of data. This affects many parts of the database's operation, from providing only moderate support for BLOBs (binary large objects) to preventing developers from creating even moderate-sized tables.
Other issues, such as the combination of views with unions, bit me (and my clients) on a recent programming project. This, combined with the lack of outer joins, has meant a number of workarounds in some recent projects. It is possible to get around most or all of these at the application level, but I am anxiously awaiting the addition of these features.
Working with MySQL has spoiled me somewhat, since the number of built-in functions is large and allows me to create database applications without creating my own functions. PostgreSQL has fewer built-in functions but, as we saw earlier, does allow me to create any that I might like, in a variety of programming languages. However, the installation and use of these languages is poorly documented; while they might be very powerful, it takes a while to get started with them.
Despite some benchmarks that were recently released by a PostgreSQL consulting group, it's safe to say that PostgreSQL is slower than MySQL. At the same time, I was pleasantly surprised to discover that the speed difference was not as great as I expected. Of course, this speed difference exists because PostgreSQL includes transactions and referential integrity, both of which require more processing and record keeping than MySQL's table-level locks.
A final drawback to PostgreSQL is that not as many web-hosting services offer it. This may be unimportant when working with dedicated servers, but some of my clients have the budget to only rent a virtual server. Database capabilities should be a consideration when looking for a web server, but, in my experience, developers often have less say in the tools used for a project than they might like. Still, if you are interested in something closer to a commercial database, including transactions and integrity constraints, PostgreSQL is your best choice.
While I continue to use MySQL, I am increasingly impressed by PostgreSQL and have begun to use it for a number of consulting projects. I have been impressed by its speed and versatility, as well as its future direction. The improvements between version 6.x and 7.x were staggering, and I look forward to seeing more!
If you are implementing database applications—including web/database applications—under Linux, I encourage you to take a look at PostgreSQL. Even if you decide to stick with MySQL, it is useful to understand how other databases work and why there is such a fuss about transactions, stored procedures and integrity constraints in the community of database programmers. And who knows? Maybe you will also find that PostgreSQL is better-suited to your applications than you think. One of the beautiful things about free software is that you can choose the tools that are best for your needs, and learning about PostgreSQL is a great step in that direction.