LJ Archive

At the Forge

Pandas

Reuven M. Lerner

Issue #264, April 2016

Reading data from CSV files, and then analyzing the data, is easy with Pandas.

In my last article, I started discussing the amazing world of data science—in which you explore and navigate through data, trying to find correlations that might be of interest to your business and/or point to trends you should consider.

Serious practitioners of data science use the full scientific method, starting with a question and a hypothesis, followed by an exploration of the data to determine whether the hypothesis holds up. But in many cases, such as when you aren't quite sure what your data contains, it helps to perform some exploratory data analysis—just looking around, trying to see if you can find something.

And, that's what I'm going to cover this month, using tools provided by the amazing Python ecosystem for data science, sometimes known as the SciPy stack. It's hard to overstate the number of people I've met in the past year or two who are learning Python specifically for data science needs. Back when I was analyzing data for my PhD dissertation, just two years ago, I was told that Python wasn't yet mature enough to do the sorts of things I needed, and that I should use the R language instead. I do have to wonder whether the tables have turned by now; the number of contributors and contributions to the SciPy stack is phenomenal, making it a more compelling platform for data analysis.

In my last article, I described how to filter through logfiles, turning them into CSV files containing the information that was of interest. Here, I explain how to import that data into Pandas, which provides an additional layer of flexibility and will let you explore the data in all sorts of ways—including graphically. Although I won't necessarily reach any amazing conclusions, you'll at least see how you can import data into Pandas, slice and dice it in various ways, and then produce some basic plots.

Pandas

NumPy is a Python package, downloadable from the Python Package Index (PyPI: PyPI.python.org), which provides a data structure known an a NumPy array. These arrays, although accessible from Python, are mainly implemented in C for maximum speed and efficiency. They also operate on a vector basis, so if you add 1 to a NumPy array, you're adding 1 to every single element in that array. It takes a while to get used to this way of thinking, and to the fact that the array should have a uniform data type.

Now, what can you do with your NumPy array? You could apply any number of functions to it. Fortunately, SciPy has an enormous number of functions defined and available, suitable for nearly every kind of scientific and mathematical investigation you might want to perform.

But in this case, and in many cases in the data science world, what I really want to do is read data from a variety of formats and then explore that data. The perfect tool for that is Pandas, an extensive library designed for data analysis within Python.

The most basic data structure in Pandas is a “series”, which is basically a wrapper around a NumPy array. A series can contain any number of elements, all of which should be of the same type for maximum efficiency (and reasonableness). The big deal with a series is that you can set whatever indexes you want, giving you more expressive power than would be possible in a NumPy array. Pandas also provides some additional functionality for series objects in the form of a large number of methods.

But the real powerhouse of Pandas is the “data frame”, which is something like an Excel spreadsheet implemented inside of Python. Once you get a table of information inside a data frame, you can perform a wide variety of manipulations and calculations, often working in similar ways to a relational database. Indeed, many of the methods you can invoke on a data frame are similar or identical in name to the operations you can invoke in SQL.

Installing Pandas isn't very difficult, if you have a working Python installation already. It's easiest to use pip, the standard Python installation program, to do so:

sudo pip install -U numpy matplotlib pandas

The above will install a number of different packages, overwriting the existing installation if an older version of a package is installed.

As good as Pandas is, it's even better when it is integrated with the rest of the SciPy stack and inside of the Jupyter (that is, IPython) notebook. You can install this as well:

sudo pip install -U 'jupyter[notebook]'

Don't forget the quotes, which ensure that the shell doesn't try to interpret the square brackets as a form of shell globbing. Now, once you have installed this, run the Jupyter notebook:

jupyter notebook

If all goes well, the shell window should fill with some logfile output. But soon after that, your Web browser will open, giving you a chance (using the menu on the right side of the page) to open a new Python page. The idea is that you'll then interact with this document, entering Python code inside the individual cells, rather than putting them in a file. To execute the code inside a cell, just press Shift-Enter; the cell will execute, and the result of evaluating the final line will be displayed.

Even if I wasn't working in the area of data science, I would find the Jupyter Notebook to be an extremely clean, easy-to-use and convenient way to work with my Python code. It has replaced my use of the text-based Python interactive shell. If nothing else, the fact that I can save and return to cells across sessions means that I spend less time re-creating where I was the previous time I worked on a project.

Inside Jupyter Notebook, you'll want to load NumPy, Pandas and a variety of related functionality. The easiest way to do so is to use a combination of Python import statements and the %pylab magic function within the notebook:

%pylab inline
import pandas as pd
from pandas import Series, DataFrame

The above ensures that everything you'll need is defined. In theory, you don't need to alias Pandas to pd, but everyone else in the Pandas world does so. I must admit that I avoided this alias for some time, but finally decided that if I want my code to integrate nicely with other people's projects, I really should follow their conventions.

Reading the CSV

Now let's read the CSV file that I created for last month's article. As you might remember, the file contains a number of columns, separated by tabs, which were created from an Apache logfile. It turns out that CSV, although a seemingly primitive format for exchanging information, is one of the most popular methods for doing so in the data science world. As a result, Pandas provides a variety of functions that let you turn a CSV file into a data frame.

The easiest and most common such function is read_csv. As you might expect, read_csv can be handed a filename as a parameter, which it'll read and turn into a data frame. But read_csv, like many other of the read_* functions in Pandas, also can take a file object or even a URL.

I started by trying to read access.csv, the CSV file from last month's article, with the read_csv method:

df = pd.read_csv('access.csv')

Unfortunately, this failed, with a very strange error message, indicating that different lines of the file contained different numbers of fields. After a bit of thought and debugging, it turns out that this error is because the file contains tab-separated values, and that the default setting of pd.read_csv is to assume comma separators. So, you can retry your load, passing the sep parameter:

df = pd.read_csv('access.csv', sep='\t')

And sure enough, that worked! Moreover, if you ask for the keys of the Pandas data frame you have just created, you get the headers as they were defined at the top of the file. You can see those by asking the data frame to show you its keys:

df.keys()

Now, you can think of a data frame as a Python version of an Excel spreadsheet or of a table in a two-dimensional relational database, but you also can think of it as a set of Pandas series objects, with each series providing a particular column.

I should note that read_csv (and the other read_* functions in Pandas) are truly amazing pieces of software. If you're trying to read from a CSV file and Pandas isn't handling it correctly, you either have an extremely strange file format, or you haven't found the right option yet.

Navigating through the Data Frame

Now that you've loaded the CSV file into a data frame, what can you do with it? First, you can ask to see the entire thing, but in the case of this example CSV file, there are more than 27,000 rows, which means that printing it out and looking through it is probably a bad idea. (That said, when you look at a data frame inside Jupyter, you will see only the first few rows and last few rows, making it easier to deal with.)

If you think of your data frame as a spreadsheet, you can look at individual rows, columns and combinations of those.

You can ask for an entire column by using the column (key) name in square brackets or even as an attribute. Thus, you can get all of the requested URLs by asking for the “r” column, as follows:

df['r']

Or like this:

df.r

Of course, this still will result in the printing of a very large number of rows. You can ask for only the first five rows by using Python slice syntax—something that's often quite confusing for people when they start with Pandas, but which becomes natural after a short while. (Remember that using an individual column name inside square brackets produces one column, whereas using a slice inside square brackets produces one or more rows.)

So, to see the first ten rows, you can say:

df[:10]

And of course, if you're interested only in seeing the first ten HTTP requests that came into the server, then you can say:

df.r[:10]

When you ask for a single column from a data frame, you're really getting a Pandas series, with all of its abilities.

One of the things you often will want to do with a data frame is figure out the most popular data. This is especially true when working with logfiles, which are supposed to give you some insights into your work. For example, perhaps you want to find out which URLs were most popular. You can ask to count all of the rows in df:

df.count()

This will give you a total of all rows. But, you also can retrieve a single column (which is a Pandas series) and ask it to count the number of times each value appears:

df['r'].value_counts()

The resulting series has indexes that are the values (that is, URLs) themselves and also a count (in descending order) of the number of times each one appeared.

Plotting

This is already great, but you can do even better and plot the results. For example, you might want to have a bar graph indicating how many times each of the top ten URLs was invoked. You can say:

df['r'].value_counts()[:10].plot.bar()

Notice how you take the original data frame, count the number of times each value appears, take the top ten of those, and then invoke methods for plotting via Matplotlib, producing a simple, but effective, bar chart. If you're using Jupyter and invoked %pylab inline, this actually will appear in your browser window, rather than an external program.

You similarly can make a pie chart:

df['r'].value_counts()[:10].plot.pie()

But wait a second. This chart indicates that the most popular URL by a long shot was /feed/, a URL used by RSS readers to access my blog. Although that's flattering, it masks the other data I'm interested in. You thus can use “boolean indexing” to retrieve a subset of rows from df and then plot only those rows:

df[~df.r.str.contains('/feed/')]['r'].value_counts()[:10].plot.pie()

Whoa...that looks huge and complicated. Let's break it apart to understand what's going on:

  • This used boolean indexing to retrieve some rows and get rid of others. The conditions are expressed using a combination of generic Python and NumPy/Pandas-specific syntax and code.

  • This example used the str.contains method provided by Pandas, which enables you to find all of the rows where the URL contained “/feed/”.

  • Then, the example used the (normally) bitwise operator ~ to invert the logic of what you're trying to find.

  • Finally, the result is plotted, providing a picture of which URLs were and were not popular.

Reading the data from CSV and into a data frame gives great flexibility in manipulating the data and, eventually, in plotting it.

Conclusion

In this article, I described how to read logfile data into Pandas and even executed a few small plots with it. Next month, I explain how you can transform data even more to provide insights for everyone interested in the logfile.

Reuven M. Lerner offers training in Python, Git and PostgreSQL to companies around the world. He blogs at blog.lerner.co.il, tweets at @reuvenmlerner and curates DailyTechVideo.com. Reuven lives in Modi'in, Israel, with his wife and three children.

LJ Archive