Issue #277, May 2017

Which of these data points doesn't belong? Machine learning can tell you.

In my the last few articles, I've looked at a number of ways machine learning can help make predictions. The basic idea is that you create a model using existing data and then ask that model to predict an outcome based on new data.

So, it's not surprising that one of the most amazing ways machine learning is being applied is in predicting the future. Just a few days before writing this piece, it was announced that machine learning models actually might be able to predict earthquakes—a goal that has eluded scientists for many years and that has the potential to save thousands, and maybe even millions, of lives.

But as you've also seen, machine learning can be used to “cluster” data—that is, to find patterns that humans either can't or won't see, and to try to put the data into various “clusters”, or machine-driven categories. By asking the computer to divide data into distinct groups, you gain the opportunity to find and make use of previously undetected patterns.

Just as clustering can be used to divide data into a number of coherent groups, it also can be used to decide which data points belong inside a group and which don't. In “novelty detection”, you have a data set that contains only good data, and you're trying to determine whether new observations fit within the existing data set. In “outlier detection”, the data may contain outliers, which you want to identify.

Where could such detection be useful? Consider just a few questions you could answer with such a system:

Are there an unusual amount of login attempts from a particular IP address?

Are any customers buying more than the typical number of products at a given hour?

Which homes are consuming above-average amounts of water during a drought?

Which judges convict an unusual number of defendants?

Should a patient's blood tests be considered normal, or are there outliers that require further checks and examinations?

In all of those cases, you could set thresholds for minimum and maximum values and then tell the computer to use those thresholds in determining what's suspicious. But machine learning changes that around, letting the computer figure out what is considered “normal” and then identify the anomalies, which humans then can investigate. This allows people to concentrate their energies on understanding whether the outliers are indeed problematic, rather than on identifying them in the first place.

So in this article, I look at a number of ways you can try to identify outliers using the tools and libraries that Python provides for working with data: NumPy, Pandas and scikit-learn. Just which technique and tools will be appropriate for your data depend on what you're doing, but the basic theory and practice presented here should at least provide you with some food for thought.

Humans are excellent at finding patterns, and they're also quite good at finding things that don't fit a pattern. But, what sort of algorithm can look at a group of data sets and figure out which is unlike the others?

One simple way to do this is to set a cutoff, often done at one or two standard deviations. For those of you without a background in statistics (or who have forgotten what a “standard deviation” is), it's a measurement of how spread out the data is. For example:

>>> a = np.array([10,10,10,10,10,10,10]) >>> print("std = {}, mean = {}".format(a.std(), a.mean())) std = 0.0, mean = 10.0

In the above example, I have a NumPy array containing seven instances of the number ten. People often think of the mean as describing the data, and it does, but it's only when combined with the standard deviation that you can know how much the numbers differ from one another. In this case, they're all identical, so the standard deviation is 0.

In this example, the mean remains the same, but the standard deviation is quite different:

>>> a = np.array([5,15,0,20,-5,25,10]) >>> print("std = {}, mean = {}".format(a.std(), a.mean())) std = 10.0, mean = 10.0

Here, the mean has not changed, but the standard deviation has. You can see, from just those two numbers, that although the numbers remain centered around 10, they also are spread out quite a bit.

One simple way to detect unusual data is to look for all of the values that lie outside of two standard deviations from the mean, which accounts for about 95% of the data. (You can go further out if you want; 99.73% of data points are within three standard deviations, and 99.994% are within four.) If you're looking for outliers in an existing data set, you can do something like this:

>>> a = np.array([-5,15,0,20,-5,25,1000]) >>> print(a.std()) 347.19282415231044 >>> min_cutoff = a.mean() - a.std()*2 >>> max_cutoff = a.mean() + a.std()*2 >>> print(a[(a<min_cutoff) | (a>max_cutoff)]) array([1000])

Sure enough, that found an outlier in the data.

It's even easier if you have a bunch of new data and want to determine whether those values would fit inside or outside your existing data set:

>>> new_data = np.array([-5000, -3000, -1000, -500, 20, 60, 500, 800, >>> 900]) >>> print(new_data[(new_data<min_cutoff) | (new_data>max_cutoff)]) array([-5000, -3000, -1000, 900])

The good news is that this is simple—simple to understand, simple to implement and simple to automate.

However, it's also too simple for most data. You're unlikely to be looking at a single-dimensional vector. The baseline (mean) is likely to shift over time. And besides, there must be other, better ways to measure whether something is “inside” or “outside”, right?

For real-world anomaly detection, you're going to need to improve on a few fronts. You'll need to consider the data and determine what's “in” and what's “out”. You'll also need to figure out ways to evaluate your model.

Let's consider novelty detection: there is initial data, and you want to know if a new piece of data would fit inside the existing data or if it would be considered an outlier. For example, consider a patient who comes in with values from a blood test. Do those tests indicate that the patient is normal, because the data's values are similar to the ones you've already seen? Or are those new values statistical outliers, indicating that the patient needs additional attention?

In order to experiment with novelty and outlier detection, I downloaded historic precipitation data for an area of Pennsylvania (Wyncote), just outside Philadelphia, for every day in 2016. Because I'm a scientific kind of guy, I downloaded the data in metric units. The data came from the US government, at https://www.climate.gov/maps-data/dataset/past-weather-zip-code-data-table.

That site contains clear instructions for downloading data from here: https://www.ncdc.noaa.gov/cdo-web/datasets.

It's quite amazing what government data is freely available, and the sorts of analysis you can do with it once you've retrieved it.

I downloaded the data as a CSV file and then used Pandas to read it into a data frame:

>>> df = pd.read_csv('/Users/reuven/downloads/914914.csv', usecols=['PRCP', 'DATE'])

Notice that I was interested only in PRCP (precipitation) and DATE (the date, in YYYYMMDD format). I then manipulated the data to break apart the DATE column and then to remove it:

>>> df['DATE'] = df['DATE'].astype(np.str) >>> df['MONTH'] = df['DATE'].str[4:6].astype(np.int8) >>> df['DAY'] = df['DATE'].str[6:8].astype(np.int8) >>> df.drop('DATE', inplace=True, axis=1)

Why would I break the date apart? Because it'll likely be easier for models to work with three separate numeric columns, rather than a single date-time column. Besides, having these columns as part of my model will make it easier to understand whether snow in July is abnormal. I ignore the year, since it's the same for every record, which means that it can't help me as a predictor in this model.

My data frame now contains 353 rows—I'm not sure why it's not 365—of data from 2016, with columns indicating the amount of rain (in mm), the date and the month.

Based on this, how can you build a model to indicate whether rainfall on a given day is normal or an outlier?

In scikit-learn, you always use the same method: you import the estimator class, create an instance of that class and then fit the model. In the case of supervised learning, “fitting” means teaching the model which inputs go with which outputs. In the case of unsupervised learning, which I'm doing here, you use “fit” with just a set of inputs, allowing the model to distinguish between inliers and outliers.

In the case of this data, there are several types of models that I can
build. I experimented a bit and found that the
`IsolationForest`
estimator gave me the best results. Here's how I create and train the
model:

>>> from sklearn.ensemble import IsolationForest >>> model = IsolationForest() >>> model.fit(df)

The model now has been trained, so I can find out whether a given amount of rain, on a certain month and day, is considered normal.

To try things out, I check the model against its own inputs:

>>> Series(model.predict(df)).value_counts()

In the above code, I run `model.predict(df)`. This gives the inputs to
the model and asks it to predict whether these are normal, expected
values (indicated by 1) or outlier values (indicated by –1). By
turning the result into a Pandas series and then calling
`value_counts`,
I see:

1 317 -1 36

Although it falsely marked 36 days as outliers, maybe those days were unusual. The model certainly would be improved if it had multiple years' worth of data, rather than just one year's worth.

Now what? I can ask the system to make some predictions:

for i in range(1, 13): print(model.predict([[15, i, 16]]))

This will tell whether it's normal to get 15 mm rain on the 16th of each month. The conclusion of the model: yes, it's perfectly normal in February–July, but not so in August–January. What about if there's zero precipitation:

for i in range(1, 13): print(model.predict([[0, i, 16]]))

It turns out that no matter what month, it's never an outlier to have zero rain on the 16th of the month.

Of course, those are just crude tests. The real thing to do is use our
old friend `train_test_split`:

>>> from sklearn.model_selection import train_test_split >>> X_train, X_test = train_test_split(df) >>> model.fit(X_train) >>> Series(model.predict(X_test)).value_counts()

The model did pretty well, given that I didn't even try to tune it:

1 77 -1 12 dtype: int64

In other words, given data that should all be classified as inliers, you can see here that the overwhelming majority is indeed classified correctly.

There are other types of estimators you can use as well. In particular, the One-Class SVM estimator has had a good track record of working with input data. That, combined with a larger data set, might well improve the results shown above—although in trying One-Class SVM for this article, I didn't see any such results. It's possible that if I were to add several more years' worth of data, other estimators would work better.

Novelty and outlier detection is (yet another) large, exciting and growing use for machine learning. As usual with machine learning, the problem is not one of coding, but rather of massaging the data into a format that you can use, and then tinkering with model definitions until you find one that predicts or identifies outliers with a high degree of confidence.

Copyright © 1994 - 2017 Linux Journal. All rights reserved.