post

The blunders from my first data science competition that you can avoid

There is a dark room in my house that has no windows, is 5 degrees warmer, is a large part of my electrical bill, and has a high pitched whirring. My computer lab attests to my addiction for overclocking computer components. While it started as a testament to cryptocurrencies, this time my GPUs have been overworked trying to finish the last few epochs of my models for the Kaggle Web Traffic Time Series forecasting.

Kaggle is a free online platform that allows users to learn how data science works and compete to win different recognition and prizes. I used this platform as a way to test what I have been learning for the past several months. I landed on the Web Traffic competition that was due to end in a couple of weeks. At one point, a thousand teams were trying to estimate Wikipedia’s page views from Sept until Nov in an attempt to win a $25K prize. It was tough and unfortunately, and while I learned lots, here are some of the major mistakes I ran into along the way.

Educational resources and thanks up front

The only reason I got this far was thanks to the Fast.AI MOOC taught by Jeremy Howard that I have been studying for the past several months. While my code for the competition is here, I would suggest you borrow Jeremy’s Rossman code for a better understanding of building out time series problems.

What are we trying to predict here?

Being given a large stack of historical data we are working to predict Wikipedia page visits per day in the future. Best advice is always to split data into three different sets.

  1. Training Data- school book of examples
  2. Validation Data- practice problems with answer
  3. Predictions (testing) – the teacher’s test

The competition gives you the training data which you can split up into the different sets. While there are many different ways to run the models, I was using Keras to make my predictions. My model runs appeared like this.

Let me decrypt this for a second. Epoch states that the model will run through the calculations one time. ETA has time remaining (most iterations are 11 minutes), and the loss (the degree it was wrong) sits at 26. My training data set had 49.7 million, and my validation data had 21 million points.

Ideally, the training set would be even larger than the validation data (the goal being around 80-90%) however, the 8 GB on my GPU could not hold any more data in memory. Which brings me to my first mistake.

I was not able to use multiple GPUs

Although I have several GPUs at my disposal, for some reason, I could not get the data to spread across multiple devices. Nor could I run different models assigned to individual models. Having hardware that was underutilized was a huge disappointment because I would have been able to run more epochs and hold larger datasets. Additionally, I opted to reduce the number of features in my data to favor having more days of data. To put it another way, my model did not spend enough time running through its studies.

Here is an example of some good solid training. See how the orange line (actual values from the valid set) closely maps to the blue line (predictions made of the valid sets). This data would have a lower loss rate and do an excellent job of predicting how many visits a page will see.

However, here are some less trained models. I include the red line is the mean for the data since that was a popular method of estimation in the competition.

The most dreaded graph is this monstrosity.

The top models had more runs against it, as it was smaller and much faster to run. The lower three charts do not have the same numbers of runs as the extra data slowed it down too much. So these models did not have around 50 rounds run of them but around 10. Fewer iterations of the model almost directly related to a decrease in accuracy for my data. However, with all the variables it took nearly one hr to run, and I ran out of time because….

The dreaded 12:30 AM data extraction error

Be careful with your data. I had spent my lunch hour setting up a run that lasted 9 hours only to discover that some data was amiss. Most of my data featured several features used to predict the number of visits and would look like this.

When sorted according to date I saw several dates that occurred before the competition began. While searching data I realized that I was taking out the wrong date. In some cases, I was grabbing the first date instead of the second date, which meant my model was considering all the dates to occur on the same day (an example below). This oversight caused errors for 10,000 of my data points.

By the time I fixed the model I had let it run overnight getting a solid ten epochs into the dataset. I finally submitted the model right before the deadline with a sigh of relief.

Interesting details of the competition

One pleasant surprise is how friendly the different participants are. There are many conversations regarding techniques people are working on, trying out, and their thoughts. Even if you are a beginner, some great models take no hardware. A rather popular example was the 1-line solution which gives a real straight forward way to predict the data. I learned lots just from studying this line!

Additionally, although the competition submission deadline has passed, actual scoring will go on for the next few months. So while I sit rather low right now, I am hoping to see my position slowly grow as they update the scores over the next few weeks.