LJ Archive

At the Forge

Preparing Data for Machine Learning

Reuven M. Lerner

Issue #271, November 2016

Before you can use machine-learning models, you need to clean the data.

When I go to Amazon.com, the online store often recommends products I should buy. I know I'm not alone in thinking that these recommendations can be rather spooky—often they're for products I've already bought elsewhere or that I was thinking of buying. How does Amazon do it? For that matter, how do Facebook and LinkedIn know to suggest that I connect with people whom I already know, but with whom I haven't yet connected online?

The answer, in short, is “data science”, a relatively new field that marries programming and statistics in order to make sense of the huge quantity of data we're creating in the modern world. Within the world of data science, machine learning uses software to create statistical models to find correlations in our data. Such correlations can help recommend products, predict highway traffic, personalize pricing, display appropriate advertising or identify images.

So in this article, I take a look at machine learning and some of the amazing things it can do. I increasingly feel that machine learning is sort of like the universe—already vast and expanding all of the time. By this, I mean that even if you think you've missed the boat on machine learning, it's never too late to start. Moreover, everyone else is struggling to keep up with all of the technologies, algorithms and applications of machine learning as well.

For this article, I'm looking at a simple application of categorization and “supervised learning”, solving a problem that has vexed scientists and researchers for many years: just what makes the perfect burrito? Along the way, you'll hopefully start to understand some of the techniques and ideas in the world of machine learning.

The Problem

The problem, as stated above, is a relatively simple one to understand: burritos are a popular food, particularly in southern California. You can get burritos in many locations, typically with a combination of meat, cheese and vegetables. Burritos' prices vary widely, as do their sizes and quality. Scott Cole, a PhD student in neuroscience, argued with his friends not only over where they could get the best burritos, but which factors led to a burrito being better or worse. Clearly, the best way to solve this problem was by gathering data.

Now, you can imagine a simple burrito-quality rating system, as used by such services as Amazon: ask people to rate the burrito on a scale of 1–5. Given enough ratings, that would indicate which burritos were best and which were worst.

But Cole, being a good researcher, understood that a simple, one-dimensional rating was probably not sufficient. A multi-dimensional rating system would keep ratings closer together (since they would be more focused), but it also would allow him to understand which aspects of a burrito were most essential to its high quality.

The result is documented on Cole's GitHub page (https://srcole.github.io/100burritos), in which he describes the meticulous and impressive work that he and his fellow researchers did, bringing tape measures and scales to lunch (in order to measure and weigh the burritos) and sacrificing themselves for the betterment of science.

Beyond the amusement factor—and I have to admit, it's hard for me to stop giggling whenever I read about this project—this can be seen as a serious project in data science. By creating a machine-learning model, you can not only describe burrito quality, but you also can determine, without any cooking or eating, the quality of a potential or theoretical burrito.

The Data

Once Cole established that he and his fellow researchers would rate burritos along more than one dimension, the next obvious question was: which dimensions should be measured?

This is a crucial question to ask in data science. If you measure the wrong questions, then even with the best analysis methods, your output and conclusions will be wrong. Indeed, a fantastic new book, Weapons of Math Destruction by Cathy O'Neil, shows how the collection and usage of the wrong inputs can lead to catastrophic results for people's jobs, health care and safety.

So, you want to measure the right things. But just as important is to measure distinct things. In order for statistical analysis to work, you have to ensure that each of your measures is independent. For example, let's assume that the size of the burrito will be factored in to the quality measurement. You don't want to measure both the volume and the length, because those two factors are related. It's often difficult or impossible to separate two related factors completely, but you can and should try to do so.

At the same time, consider how this research is being done. Researchers are going into the field (which is researcher-speak for “going out to lunch”) and eating their burritos. They might have only one chance to collect data. This means it'll likely make sense to collect more data than necessary, and then use only some of it in creating the model. This is known as “feature selection” and is an important aspect of building a machine-learning model.

Cole and his colleagues decided to measure ten different aspects of burrito quality, ranging from volume to temperature to salsa quality. They recorded the price as well to see whether price was a factor in quality. They also had two general measurements: an overall rating and a recommendation. All of these measurements were taken on a 0–5 scale, with 0 indicating that it was very bad and 5 indicating that it was very good.

It's important to point out the fact that they collected data on more than ten dimensions doesn't mean all of those measurements needed to be included in the model. However, this gave the researchers a chance to engage in feature selection, determining which factors most most affected the burrito quality.

I downloaded Cole's data, in which 30 people rated more than 100 burritos at 31 different restaurants, from a publicly viewable spreadsheet in Google Docs, into a CSV file (burritos.csv). The spreadsheet's URL is https://docs.google.com/spreadsheets/d/18HkrklYz1bKpDLeL-kaMrGjAhUM6LeJMIACwEljCgaw/edit#gid=1703829449.

I then fired up the Jupyter (aka IPython) Notebook, a commonly used tool in the data science world. Within the notebook, I ran the following commands to set up my environment:

%pylab inline                         # load NumPy, display 
                                      # Matplotlib graphics
import pandas as pd                   # load pandas with an alias
from pandas import Series, DataFrame  # load useful Pandas classes
df = pd.read_csv('burrito.csv')       # read into a data frame

At this point, the Pandas data frame contains all the information about the burritos. Before I could continue, I needed to determine which fields were the inputs (the “independent variables”, also known as “predictors”) and which was the output (the “dependent variable”).

For example, let's assume that the burritos were measured using a single factor, namely the price. The price would be the input/independent variable, and the quality rating would be the output/dependent variable. The model then would try to map from the input to the output.

Machine learning (and statistical models) works the same way, except it uses multiple independent variables. It also helps you determine just how much of an influence each input has on the output.

First, then, you'll need to examine your data, and identify which column is the dependent (output) variable. In the case of burritos, I went with the 0–5 overall rating, in column X of the spreadsheet. You can see the overall rating within Pandas with:

df['overall']

This returns a Pandas series, representing the average overall score from all of the samples at a particular restaurant.

Now that I have identified my output, which inputs should I choose? This is what I described earlier, namely feature selection. Not only do you want to choose a relatively small number of features (to make the model work faster), but you also want to choose those features that truly will influence the output and that aren't conflated with one another.

Let's start by removing everything but the feature columns. Instead of dropping the columns that I find uninteresting, I'll just create a new data frame whose values are taken from the interesting columns on this one. I'll want the columns with indexes of 11 through 23, which means that I can issue the following command in Pandas:

burrito_data = df[range(11,23)]

range() is a Python function that returns an iterator; in this case, the iterator will return 11 through 22 (that is, up to and not including 23). In this way, you can retrieve certain columns, in a smaller data frame. However, you still need to pare down your features.

Notice that my new data frame contains only the independent (input) variables; the overall score, which is our output variable, will remain on the side for now.

Feature Selection

Now that I have all of the input variables, which should I choose? Which are more dependent on one another? I can create a “correlation matrix”, giving me a numeric value between 0 (uncorrelated) and 1 (totally correlated). If I invoke the “corr” method on the data frame, I'll get a new data frame back, showing the correlations among all of them—with a correlation of 1.0 along the diagonal:

burrito_data.corr()

Now, it's true that you can look through this and understand it to some degree. But it's often easier for humans to understand images. Thus, you can use matplotlib, invoking the following:

plt.matshow(burrito_data.corr())

That produces a nice-looking, full-color correlation matrix in which the higher the correlation, the redder the color. The reddish squares show that (for example) there was a high correlation between the “length” and “volume” (not surprisingly), and also between the “meat” and the “synergy”.

Another consideration is this: how much does a particular input variable vary over time? If it's always roughly the same, it's of no use in the statistical model. For example, let's assume that the price of a burrito is the same everywhere that the researchers ate. In such a case, there's no use trying to see how much influence the price will have.

You can ask Pandas to tell you about this, using the “var” method on the data frame. When I execute burrito_data.var(), I get back a Pandas series object:

burrito_data.var()

Length          4.514376
Circum          2.617380
Volume          0.017385
Tortilla        0.630488
Temp            1.047119
Meat            0.797647
Fillings        0.765259
Meat:filling    1.084659
Uniformity      1.286631
Salsa           0.935552
Synergy         0.898952
Wrap            1.384554
dtype: float64

You can see that the burrito volume changes very little. So, you can consider ignoring it when it comes to building the model.

There's another consideration here, as well: is there enough input data from all of these features? It's normal to have some missing data; there are a few ways to handle this, but one of them is simply to try to work without the feature that's missing data. You can use the “count” method on the data frame to find which columns might have too much missing data to ignore:

burrito_data.count()

Length          127
Circum          125
Volume          121
Tortilla        237
Temp            224
Meat            229
Fillings        236
Meat:filling    231
Uniformity      235
Salsa           221
Synergy         235
Wrap            235
dtype: int64

As you can see, a large number of data points for the three inputs that have to do with burrito size are missing. This, according to Cole, is because the researchers didn't have a tape measure during many of their outings. (This is but one of the reasons why I insist on bringing a tape measure with me whenever I go out to dinner with friends.)

Finally, you can ask scikit-learn to tell you which of these predictors contributed the most, or the least, to the outputs. You provide scikit-learn with inputs in a data frame and outputs in a series—for example:

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

X = burrito_data
y = df[[23]]

In the above code, I import some objects I'll need in order to help with feature selection. I then use the names that are traditional in scikit-learn, X and y, for the input matrix and output series. I then ask to identify the most significant features:

sel = SelectKBest(chi2, k=7)
sel.fit_transform(X, y)

Notice that when invoking SelectKBest, you have to provide a value for “k” that indicates how many predictors you want to get back. In this way, you can try to reduce your large number of predictors to a small number. But if you try to run with the above, you'll encounter a problem. If there is missing data (NaN) in your input matrix, SelectKBest will refuse to run. So it's a good thing to discover which of your inputs are sometimes missing; if you remove those columns from the input matrix, you can use some feature reduction.

Cole and his colleagues did this sort of analysis and found that they could remove some of their input columns—the “flavor synergy”, as well as those having to do with burrito size. Having gone through the above process, I'm sure you can easily understand why.

Conclusion

Now that you have a good data set—with an input matrix and an output series—you can build a model. That involves choosing one or more algorithms, feeding data into them and then testing the model to ensure that it's not overfit.

In my next article, I plan to do exactly that—take the data from here and see how to build a machine-learning model. I hope that you'll see just how easy Python and scikit-learn make the process of doing the actual development. However, I'll still have to spend time thinking about what I'm doing and how I'm going to do it, as well as which tools are most appropriate for the job.

Reuven M. Lerner offers training in Python, Git and PostgreSQL to companies around the world. He blogs at blog.lerner.co.il, tweets at @reuvenmlerner and curates DailyTechVideo.com. Reuven lives in Modi'in, Israel, with his wife and three children.

LJ Archive