At the Forge

Learning Data Science

Reuven M. Lerner

Issue #278, June 2017

Data science is big. If you want to learn it, where do you start?

In my last few articles, I've written about data science and machine learning. In case my enthusiasm wasn't obvious from my writing, let me say it plainly: it has been a long time since I last encountered a technology that was so poised to revolutionize the world in which we live.

Think about it: you can download, install and use open-source data science libraries, for free. You can download rich data sets on nearly every possible topic you can imagine, for free. You can analyze that data, publish it on a blog, and get reactions from governments and companies.

I remember learning in high school that the difference between freedom of speech and freedom of the press is that not everyone has a printing press. Not only has the internet provided everyone with the equivalent of a printing press, but it has given us the power to perform the sort of analysis that until recently was exclusively available to governments and wealthy corporations.

During the past year, I have increasingly heard that data science is the sexiest profession of the 21st century and the one that will be in greatest demand. Needless to say, those two things make for a very appealing combination! It's no surprise that I've seen a major uptick in the number of companies inviting me to teach on this subject.

The upshot is that you—yes, you, dear reader—should spend time in the coming months, weeks and years learning whatever you can about data science. This isn't because you will change jobs and become a data scientist. Rather, it's because everyone is going to become a data scientist. No matter what work you do, you'll be better at it, because you will be able to use the tools of data science to analyze past performance and make predictions based on it.

Back when I started to develop web applications, it was the norm to have a database team that created the tables and queries. Nowadays, although there certainly are places that have a full-time database staff, the assumption is that every developer has at least a passing familiarity with relationship (or even NoSQL) databases and how to work with them. In the same way that developers who understand databases are more powerful than those who don't, people in the computer field who understand data science are more powerful than those who don't.

There is a bit of bad news on this front, though. If you thought that the pace of technological change in programming and the web moved at a breakneck pace, you haven't seen anything yet! The world of data science—the tools, the algorithms, the applications—are moving at an overwhelming speed. The good news is that everyone is struggling to keep up, which means if you find yourself overwhelmed, you're probably in very good company. Just be sure to keep moving ahead, aiming to increase your understanding of the theory, algorithms, techniques and software that data scientists use.

Where should you start? In this article, I describe some of the resources I've found to be the most helpful as I've been diving deeper and deeper into data science.

Statistics

There's no way around it. If you're going to do data science, you're going to need to learn some statistics. I took a year of it in graduate school, and then I did some analysis as part of my dissertation, but there's a lot I don't know, so I've been trying to improve my understanding. Every little bit helps! Whether you're simply learning Bayes' Theorem, figuring out how linear regression works or learning how to modify your data to minimize errors, statistics is a crucial part of this.

So, where do you start? There are a number of courses, often for free or at very low cost, at edX, Udemy and Coursera. A particularly popular introduction to machine learning, which includes the basic statistical knowledge you'll need, is taught by Stanford professor Andrew Ng via Coursera. If you're looking for something more hard-core, I definitely recommend the Udemy courses by LazyProgrammer.

Two good and standard textbooks on the subject are An Introduction to Statistical Learning (by James, Witten, Hastie and Tibshirani) and Elements of Statistical Learning (by Hastie, Tibshirani and Friedman). Both books are published by Springer, and both are available in PDF form, as free downloads: www.springer.com/us/book/9781461471370 (An Introduction to Statistical Learning) and statweb.stanford.edu/~tibs/ElemStatLearn (Elements of Statistical Learning). You probably should download and read those books; over time, the ideas and methods they describe will help you to reason about what you're doing.

I also want to recommend the various books and courses offered by Jason Brownlee at his site machinelearningmastery.com. His writing is clear, and he tries to be very practical about what he shows you. Especially if you're using Python for machine learning, his books are a great way to get started and improve your understanding.

Note that you definitely should not wait until you have read through books, watched lectures and taken courses to start playing with machine learning. That would be akin to saying you should try to learn a language only after you have mastered its grammar. As with language, you should be trying to use it at the same time that you're learning how it works.

Along with understanding the math, it's also important to have a good skeptical, statistical look at the world. Jake VanderPlas has a talk called “Statistics for Hackers” that not only translates the mathematical ideas into code, but it also concentrates on the aspects that are most likely to be of interest in data science.

Two other books worth mentioning are The Cartoon Guide to Statistics (by Larry Gotnick and Woollcott Smith) and Statistics Done Wrong (by Alex Reinhart). Both books are good for getting you to think in this way—by which I mean, when someone presents you with data, or if you are about to present others with data, you'll at least find some of the holes in the argument or alternative explanations to yours.

Data Science Theory

Although statistics certainly is an important part of data science, it's not the only part. Indeed, there are a number of model types that aren't statistical, such as K Nearest Neighbors.

Knowing the different types of algorithms that are available, when each is appropriate and how to tweak them will be invaluable. In many cases, you'll just want to throw a bunch of algorithms at the problem—and if your data set is small and/or easy to understand, that'll be just fine. But if it takes a long time to train your model, trying a dozen different algorithms is neither smart nor effective. Just as an expert cook knows which knife to use, and a good programmer should know which language is appropriate for a given task, someone building machine learning models should know which algorithms are more likely to be useful. (It's not always 100% obvious, but you do want to narrow down your starting set.)

In addition to the books I mentioned above, some others are well worth reading and reviewing. Doing Data Science by Cathy O'Neil and Rachel Schutt, as well as the Python Data Science Handbook by Jake VanderPlas, introduce the ideas behind data science, but they also include working code and examples that you can and should play with.

A phenomenal resource is the Analysis Vidhya site (analyticsvidhya.com) that summarizes, describes and instructs in a truly staggering number of technologies, algorithms and theories. Daily email messages from this site always are interesting and useful—and, quite frankly, overwhelming in their number and scope.

Data Science Hacking

Although statisticians have been using software for many years, one of the key differences between statistics and data science is that the latter requires programming knowledge. It's no surprise, given its shallow learning curve and huge, friendly community, that Python has become the leading language for data science. If you choose to use Python (which I definitely recommend), you'll need to learn a number of libraries that don't always adhere to the standard Python way of doing things: NumPy and Pandas provide data structures, and then there's also scikit-learn, which provides the algorithms and supports for machine learning.

The websites for each of these packages, but especially scikit-learn, are huge, and they likely will make you think you never can learn it all. And indeed, no one is expecting you to know everything that those packages can do by heart. But over time, you will be expected to understand more and more algorithms and ideas, and also how to implement them.

If you're using Python, the the Jupyter notebook is likely to be your day-to-day tool of choice. Jupyter (jupyter.org) continues to expand in impressive functionality, with new versions released every few weeks. If you're new to Python or to dynamic languages in general, Jupyter can feel a bit odd, but it quickly grows on you and will become a fluid part of your day-to-day work.

As you can see, it's important to practice. I often say that programming languages are like human (natural) languages, in that you need to practice using them to gain true fluency. Data science is the same, but it's also different, in that you need fluency in several related disciplines in order to succeed.

Fortunately, the world of data science is large and growing, providing a lot of interesting data sets for people to analyze, both for fun and practice, and also for serious use. “I Quant NY” (iquantny.tumblr.com) is a blog that not only provides interesting information about New York City from city-supplied data sets, but it also shows how data scientists can ask questions and provide answers that affect many people. If you're looking for data sets, it's hard to know just where to start or what sort of analysis might be most appropriate. The weekly newsletter “Data is Plural” by Jeremy Singer-Vine (https://tinyletter.com/data-is-plural), the “data sets” subreddit (https://www.reddit.com/r/datasets) and the new website Data.World (data.world) all offer a staggering number of data sets on a variety of topics. Choose something that's of interest to you, and see what questions you can ask and answer.

I would be remiss if I didn't mention a few of the podcasts to which I listen. Not only do they provide me with the latest news, information, anecdotes and updates from the world of data science, they also allow me to understand the trends better—for example, in favor of neural networks and deep learning. “Partially Derivative” (partiallyderivative.com) and “Linear Digressions” (lineardigressions.com) are my two favorites, but there are some others, such as “Data Science at Home” (worldofpiggy.com/podcast) and “Data Skeptic” (https://www.dataskeptic.com). Podcasts aren't going to help you to code better; only more coding can really do that. But they will give you perspective and understanding that make the code more obvious.

Finally, although I believe that data science is changing our world for the better, we do need to be on the lookout for potential issues. Cathy O'Neil's book, “Weapons of Math Destruction”, is a must-read for anyone entering this world. Even if you aren't writing algorithms that will affect millions of people, awareness of our biases as humans, and of our need to be transparent when implementing policy via machine, is an important one. This easily is one of the best books I've read in the last few years.

I'll definitely return to data science topics in the future, given its importance to developers. But for my next article, I plan to return to the world of web applications and databases, looking at the languages, libraries and packages we use to create modern applications.