Perform powerful statistical analysis by embedding the R language in PostgreSQL.
I took two introductory statistics classes in graduate school and found that I really liked the subject. It wasn't always intuitive, but it always was interesting, and it really helped me to put research, polling and many newspaper stories in a new light. I don't do statistical analysis every day, but it's a powerful tool for organizing and working with large sets of data, and for finding correlations among seemingly disparate pieces of information.
For the courses I took, my university told me to buy SPSS, a commercial program that helps you perform statistical calculations. Being an open-source kind of guy, I discovered R—a programming language aimed at helping people solve problems involving calculations and statistics. R is a full-fledged language, and it theoretically can be used in a wide variety of situations. But, it was designed primarily for use in mathematics and statistical work, and that's where it really shines.
I managed to get through the class just fine using R instead of SPSS. The quality of R's documentation and the intuitive feel of the language, especially for someone experienced with Ruby and Python, meant that when my instructors demonstrated how to do something in SPSS, I managed to find the appropriate parallel in R, and even get a result before they had finished their explanation.
I have continued to use R during the past few years, both as I've progressed with my dissertation research and even on some client projects. I've used R to analyze data from text files (typically in CSV format), and I've used R to analyze data from databases, using the client packages available for various databases.
Perhaps the most intriguing place I've recently seen R, and where I've started to experiment with it in my own work, is inside PostgreSQL. PostgreSQL, long my favorite relational database, has for many years allowed for the creation of user-defined functions, similar to “stored procedures” in other databases. Whereas most databases provide a single language in which people can write their functions, PostgreSQL makes it possible to connect nearly any language to the database. By default, PostgreSQL lets you write server-side functions in Pl/PgSQL, and there long has been support for Pl/Perl and Pl/Python.
Since 2003, developer Joe Conway has maintained PL/R, allowing you to write server-side functions in PostgreSQL, using R. If you ever have used R, the possibility of using such a powerful statistics package inside your database should seem like a natural and powerful combination. Rather than having to read the data into R outside PostgreSQL, you suddenly can have R work directly on the results of queries, without needing to use a separate client application or process.
In this article, I introduce the basics of PL/R. This combination isn't for everyone, but with the growing (and welcome) popularity of PostgreSQL among Web developers, and with the increasing need for analysis of information gathered from Web users, it seems to me that PL/R could be an important tool for many developers.
The home page for R is r-project.org. From that site, you can download versions of R for a variety of operating systems, including Linux. (I was able to install R on my server running Ubuntu with apt-get install r-base-core, which installed a large number of dependent packages.) New versions of R come out every few months and normally are installed in two versions: the R language and environment and the runtime necessary to execute programs written in R, known as “Rscript”. To use the language interactively, just type R at the shell prompt.
R is an interpreted, dynamic language. It has some object-oriented features as well. At the core of R is the vector data type, which can contain any number of elements of the same type (known as “mode” within R). Even items that are seemingly scalar values actually are vectors of length 1. Thus, you don't really have integers in R, but rather one-element vectors of type integer. You can create multi-element vectors with the c() function, handing it the values you would like it to contain:
> c(1,5,9,3) [1] 1 5 9 3
This returns a vector, but now the value is lost. If you want to capture it, you must do so in a variable, using the assignment operator <-:
> x <- c(1,5,9,3)
Note that <- is the standard assignment operator in R. You can, in many places, use the more traditional = operator instead, but it is frequently recommended in the documentation that you use <- to avoid problems in certain circumstances. (I must admit, I've never experienced any problems, although I do try to use <- whenever possible.)
The beauty of a vector is that mathematical operators are applied to all of its elements. Thus:
> x + 5 [1] 6 10 14 8 > x * 3.5 [1] 3.5 17.5 31.5 10.5 > y <- x * 8.6 > y [1] 8.6 43.0 77.4 25.8
If you're wondering what the [1] means on the left side of the vector output, that's an indication of the starting index of the data you're looking at. R, like FORTRAN, but unlike most other languages I use, uses 1 as the index of the first element of an array. Thus, if I want to retrieve the value 43 from y, I need to use index 2:
> y[2] [1] 43
Note that retrieving the element at index 2 doesn't give me a scalar value, but rather a one-element vector.
Vectors are nice, but they can hold only one data type at a time. Because R is dynamically typed, it cannot stop you from entering inappropriate values. Rather, it'll cast all of the values to the best possible common type, if they're different. So if you say:
> x <- c(1,2,"abc",3) > x [1] "1" "2" "abc" "3"
notice how all the values in this vector have been turned into strings, in order to ensure that the vector's mode is of type “character”.
R allows you to create multidimensional vectors, known as matrices. To create a matrix, just invoke the matrix() function, giving it a vector as a first parameter and either the nrow parameter or the ncol parameter:
> m <- matrix(c(1,2,3,4,5,6), nrow=2) > m [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6
If you want to grab the first column—a vector, of course—you can do so:
> m[,1] [1] 1 2
You also can grab the second row:
> m[2,] [1] 2 4 6
Similar to a matrix, but containing vectors of different types, is a “data frame”. When you create a data frame, you assign a name (given as the parameter name) to a vector. The vectors then are displayed in parallel, such that index 2 in each vector can be read across, much like a spreadsheet or database table. For example:
> names <- c('tom', 'dick', 'harry') > scores <- c(90, 60, 99) > height <- c(180, 160, 190) > d <- data.frame(names=names, scores=scores, height=height) > d names scores height 1 tom 90 180 2 dick 60 160 3 harry 99 190
You can think of a data frame as almost a database table. Not surprisingly, when you execute a PL/R function inside of PostgreSQL, you can retrieve the contents of a table into a data frame and then manipulate it.
Now, in this example, you can see that the scores of the three students would appear to be correlated with their heights—such that the taller the student, the higher their score. One of the first things you learn in statistics, of course, is that correlation doesn't imply causality, so I should note that I'm definitely not trying to say taller people are smarter! But you can find, at least in our data sample here, a correlation between height and score. This is the sort of thing that R does, and does very well. The easiest way to find the correlation is to run a simple regression—meaning, to find the best possible line that will connect these dots, if “height” is the independent (x) variable and “scores” is the dependent (y) variable. In R, you would express this as:
> lm( scores ~ height, data=d) Call: lm(formula = scores ~ height, data = d) Coefficients: (Intercept) height -151.714 1.329
You can do even better than this though. You can assign the output of your call to lm() into a variable. This variable, like everything in R, will then be an object on which you can perform additional calculations:
score.lm <- lm( scores ~ height, data=d)
This object contains the information you need to know in order to predict people's scores based on their heights (assuming there is a correlation, of course, which I'm not at all claiming there is, outside of this contrived example). You then can do this:
> intercept <- coefficients(score.lm)[1] > slope <- coefficients(score.lm)[2] > new.student.height <- 157 > predicted.score <- intercept + (slope * new.student.height) > predicted.score (Intercept) 56.87143
Now, if you're trying to predict test scores based on student height, you're likely to be disappointed. But, perhaps you're trying to predict other things—for example, the number of pages people will click on if they came from a particular search keyword, or the likelihood users will purchase something from you, if they initially came to your site during lunch hour rather than at night. These are the simplest sorts of questions you can try to answer with a statistical regression, and as you begin to work with such data, you see more and more opportunities for analysis.
Now, it's possible to do this analysis in many different ways. Google Analytics is a highly popular (and powerful) platform for making certain correlations. And, of course, you always can dump your database into CSV format and then read it into R or another package for analysis. But what PL/R lets you do is run all of this analysis on your database itself, with a language (unlike Pl/PgSQL) that is optimized for fast mathematical analysis.
One of the most important parts of R is CRAN, an analog to Perl's CPAN, Python's PyPi and Ruby Gems—an extensive set of open-source packages on a wide variety of subjects, which implement functionality you might well want to use. For example, my dissertation research involves understanding what sorts of social networks were created among users of the software I created; using a CRAN project called statnet, such analysis becomes relatively easy to do.
Installing PL/R is possibly the hardest part of working with PL/R, although it has gotten significantly easier since the creation of the “extension” system in PostgreSQL. First, make sure you have installed the latest versions of both R and PostgreSQL. Earlier versions will work, but particularly in the case of PostgreSQL, a modern version will be better, thanks to the extension system. I assume in this article that you are using PostgreSQL 9.1.
Now, you need to set the R_HOME environment variable. This variable will tell the PL/R compilation and extension mechanism where to find R's header and library files. On my Ubuntu server, after installing R via apt-get, I set R_HOME to /usr/share/R:
export R_HOME=/usr/share/R
Once you've set that up, you can download the PL/R source code. At the time of this writing, the latest version is 8.3.0 and is available from the PL/R home page. Then, as the instructions indicate, go into the plr directory that results from opening the .tar.gz, and type:
USE_PGXS=1 make USE_PGXS=1 make install
Note that this doesn't install PL/R into any of your PostgreSQL databases. Rather, it makes the PL/R extension available, such that you then can create the PL/R extension inside any database that would like to benefit from it.
After installing the extension, I went into my personal PostgreSQL database (named “reuven”, same as my user name) and invoked:
SELECT * from pg_available_extensions;
I could tell that the extension had been installed correctly, because one of the output rows from this query contained plr. Thus, I was able to install it with:
CREATE EXTENSION plr;
PostgreSQL responded with CREATE EXTENSION, meaning that the query was successful.
Now that PL/R has been installed, what can you do with it? Since you installed it for the purpose of writing functions, the natural thing to do is...write a function. For example, here's a PL/R function that multiplies two numbers:
CREATE OR REPLACE FUNCTION mult(num1 INTEGER, num2 INTEGER) RETURNS INTEGER AS $$ return(num1 * num2); $$ LANGUAGE plr;
If you ever have written a PL/PgSQL function before, you'll recognize the general outline of the function-creation syntax. But between the $$ quotation symbols, instead of PL/PgSQL, you instead have an R function. Because you're not in the normal R environment, you don't have the normal R function assignment or parameters, and you do need to specify a return type. But the function works just fine, and someone using this function doesn't need to know that it was written in R:
reuven=# select mult(50, 20); mult ------ 1000 (1 row)
Where PL/R really comes into its own is when you have data that needs R-type analysis. For example, let's put the same score-height data into a database table:
CREATE TABLE Students ( name TEXT, score INTEGER, height INTEGER ); INSERT INTO Students (name, score, height) VALUES ('tom', 90, 180), ('dick', 60, 160), ('harry', 99, 190);
If you can get this data from a PostgreSQL table into an R data frame, you can perform a regression on the data, returning the slope of the intercept line:
CREATE OR REPLACE FUNCTION score_height_slope() RETURNS TEXT AS $$ students <- pg.spi.exec("select name, score, height FROM ↪students"); score.lm <- lm(score ~ height, data=students); return(score.lm[[2]]); $$ LANGUAGE PLR;
Now, note that in this case, you're not running a regression directly on the data in the table. Rather, the table data is read into R, which creates a data frame, on which you run the regression. However, the ease with which you can do this, and the way in which the SQL query (using the pg.spi.exec function) can retrieve database information and stick it in a data frame, makes all the difference. If retrieving all of the data in one fell swoop would be a problem, you might prefer to use PL/R's other implementations of the SPI (server programming interface) API for PostgreSQL, including support for working with cursors.
PL/R is one of those ideas I never would have understood if I had encountered it years ago, but now, given my exposure to (and use of) statistics, there are many ways I can foresee using it. There are some limitations; PL/R functions cannot easily call other PL/R functions, and data types don't always match up as you might expect with their PostgreSQL counterparts. But PL/R offers support for such advanced PostgreSQL features as aggregates and window functions, allowing you to produce all sorts of sophisticated reports and analysis.