What Have I Gotten Myself Into?

“Man. Yo, first I want to thank the great algorithm that put us all here.”

It was fitting to hear those words from Donald Glover as he accepted an Emmy Award on the night before I started the Metis Data Science Bootcamp. A few years ago, you’d expect an MIT professor to reference algorithms in an acceptance speech. In 2017, one of the hottest young stars in Hollywood is doing it.

Data scientist is often described as the “sexiest job of the 21st century,” as the Harvard Business Review declared in 2012. But let’s be honest – when did Harvard Business become the arbiter of what’s sexy? Hollywood decides what is sexy. When Hollywood starts naming algorithms at awards shows, you know something big is happening.  

“Big Data” has become ubiquitous in society in ways that are exciting, overwhelming … and intimidating. I’m a data junkie who can analyze data all day, but I’m accustomed to analyzing datasets with thousands of records. Not billions. Big Data is intimidating to many scientists because of its scale, as well as confusion over what “data science” is. Is it basically the same thing we’ve done for years, only with a much bigger n? Some experts in my field even see data science as a fad that will go away. Just like the same people said about the internet.

I’ve been excited about data science for a long time, but I’ve also wondered, “Is data science really that different from what I’ve done in the past?” After a week at Metis, I can definitively say yes.

What I’ve learned about data science so far

Our first project at Metis was for a hypothetical client organizing a gala in New York City. To increase attendance, the client wanted to place staff at subway stations where they would recruit people. In a monstrous city like NYC, which stations do you target? Our task was to use NYC subway data to find the answer.

It was a deceptively simple question that revealed some unique challenges of data science. I won’t go into details of our analysis (many Metis alums have done it more eloquently), but instead focus on early lessons that the project taught me.

1) Think before you code

The best thing about Big Data is that it’s easy to access. The worst thing about Big Data is that it’s easy to access. It took minutes to acquire months of subway data, which made it tempting to start coding and analyzing immediately. Unfortunately, we did this without much strategizing, which led us down rabbit holes and away from the task. Lesson for myself – put down the data (I know it’s hard), step away from the computer, and talk through a project first. This moral is true in all science fields, but the temptation is much greater when data are at your fingertips.

2) Treat your computer with love

My computer wasn’t happy with me by the end of this project. I was analyzing 500,000 records, which isn’t huge, but is enough to slow you down if you’re not careful. To make matters worse, I was performing every step of the analysis on every record, needlessly wasting memory. My computer finally collapsed in exhaustion one evening after it essentially ran a marathon with a 500,000-lb weight on its back. In hindsight, I could’ve structured the analysis differently to reduce the volume of data needed for each step. It would’ve achieved the same goal and my computer would’ve been much happier.

3) Start small, fail fast

One way to ease up on your computer is to test things out on sub-samples first. Aside from conserving resources, it helps identify coding errors that occur (which they will). Early in the week, I made a small but crucial mistake that was overlooked until it was almost too late. I probably would’ve caught the mistake in a smaller dataset, but in a large dataset, it’s easy to miss obvious things that can derail you.

4) Most of all ... don’t be a perfectionist

As is often the case, there were a lot of messy data to deal with. Many subway numbers were implausibly large or even negative. My perfectionist instinct was to fix it all because I wanted an exquisitely clean dataset. After 2 days of head-banging, though, I realized a perfect dataset was unnecessary. All we needed was the highest-traffic stations, which we could identify by taking a large sample of days and using the median across days to filter out implausible numbers. It was a simple solution to a dataset that would never be perfect.

I suspect I’ll be re-learning these lessons repeatedly in the next 12 weeks. We haven’t even started the sexy sides of data science that will make Donald Glover swoon. But the more exciting things get, the more important it will be to keep these unsexy lessons in mind.

Dan Taber