Part of the fun of learning data science is seeing how quickly it can relate to your usual roles and responsibilities. While AIG sells insurance, I catch criminals. While underwriters make predictions, I protect data. So when I needed a testbed to practice deep learning and better understand a business perspective, I turned to Kaggle’s Porto Seguro’s Safe Driver Competition.
This competition is fun because you are asked to “… build a model that predicts the probability that a driver will initiate an auto insurance claim in the next year.” To save you time, you should know our team did not win, and the 1st place winner’s submission is a fantastic read. However, I did want to cover some of the critical lessons I learned.
The Fast.ai cohort and building a team
Currently, Jeremy Howard is teaching the Fast.ai part 1 to a group of local and international students. A 7-week course it covers many of the fundamentals of setting up and running a model to drive results. Practicality first, technicalities second, taking the free classes were so good I had to apply again when he switched from Keras to PyTorch. Many of the cohorts have some fantastic articles (here, here, here) based on what we are learning from class.
For me, competing head-to-head against other data scientists helps solidify my learning, and so once again I turned to Kaggle. Since I had previously worked on time-series predictions for web traffic, I found the Porto competition especially tempting.
A great thing about the Fast.ai cohort is that you can quickly find someone who is also interested in a similar project. Fortunately, I was able to team up with Devan Govender, another student participating.
Due to time limitations. We only had about 8 days to work on the project. This accelerated timeline was great because it forced us to move into the project quickly.
I have to admit that my Github skills are lacking. Due to the Kaggle competition rules Devan and me set-up a private Github instance to share information back and forth. At the time, there was not a way to install the Fast.ai repo with pip through the Kaggle interface.
Some things I enjoyed about a private GitHub instance continues to be the ease of sharing ideas back and forth. It took only minutes to be able to run what Devan had uploaded. Plus there were more than a couple of times that GitHub provided a way to get back to a known good state.
A few commands became my bread and butter for using GitHub.
Clone- gets a copy of the project I am working on
Status and Pull- We can see that after the clone command, we have the folder with the GitHub code. Additionally, we can check status (it is up to date) and try a pull (again it is up to date). Extremely important before we start making changes to the code.
Push- After we have made our changes we label the modifications that we fill in a commit about the changes made and the files changed. Then we push it back to GitHub for someone else to use.
Additionally, in the Jupyter notebook, I set-up the code so that we would not have to change too many things do to pathing.
Also, now that the competitions are over, we can release the code allowing everyone to see it in its unfiltered madness.
Getting data in the right place
Unlike other competitions, Porto’s data is anonymized more than I would expect. The data columns labels have nondescript categories, but at least the columns are labeled into the types of data such as continuous, categorical and boolean.
Incorrect features can be a real problem for records and to correctly use them. I can barely understand what they are trying to do here. It is much more difficult to go back and interpret the meaning of the values provided.
Fortunately, we can work through it. For example, there could be a category with a value of 1 in it could be interpreted in the following way:
- a boolean value: True the insured car was in an accident
- As a categorical value: The insured vehicle is a Ford
- Continuous value: This car has gone 1 mile
However, what happens if the next record was 3 and how would that describe the relationship?
- A boolean value would not make sense because booleans are only yes or no.
- A categorical value would suggest the car is a Ferrari, not a Ford. This alteration of models could drastically change the chance of a claim.
- It could be continuous, but the difference between a car with 1 vs. 3 is likely insignificant.
As you can see accidentally mislabeling the value can have a significant effect on the data.
Luckily, in this competition, we were told which value categories. However, I wanted to doublecheck them. So I ran some analytics
- A boolean should at most have 3 values (True, False, NaN)
- A category column will likely be in the double, but not triple digits.
- A continuous will have many many unique values. Going back to the mileage example, imagine all the different mileage counts that would be available. Almost every car would have a unique category! One for cars with 1 miles, 2 miles, 3 miles… etc
We can see below that the cats, bools, and conts all make lots of sense. At least we are not as blind as we were before.
Boolean and categories and continuous oh my…
The most significant oversite I missed was how many variables were missing and how to solve for them (some categories had over 50% missing). We see these as nan values, represented as a -1, in the code. Now depending on the type of the data there can
- Boolean and categories can easily just add the nan as an additional category. Not having a value can sometimes give just as much information as having it.
- The continuous variables are a little tricky. Leaving them with a default of -1 can be goofy. Assuming that the model rated low mileage vehicles favorably, any car missing a value would be rated more favorably than brand new cars! What we tried to do late minute was just take the average of the other values to ensure that it did not impact the prediction.
I think these methods worked out fine, but it goes to point out the difficulty in working with data anonymized in this manner.
Reverse engineering features
I had major problems with training my data. The first thing I tried to do was to properly go back and classify the number of unique values in each category. This check helped ensure the data was correctly labeled or could not be improved. Even with the alterations, we continued to have problems.
Getting the right learning rate seemed fickle and when we were using the Gini coefficient, it took some time to move downwards. There were just too many things to calculate.
At this point, we saw that most competitors dropped the less important features. The last-ditch effort was attempting to remove as many as I could to better understand what might have going wrong. It did not help much.
When the dust settled, we placed in the Top 16% of over 5,000 submissions. I certainly learned lots about how to coordinate better with my partner (thanks Devan!) and the importance of understanding the different types of features. However, more it helped me realize how important it is when collecting data to know what it looks like and how to represent data. We are still long ways off from just throwing all our data into a black box to see what magic pops out the other side.
Disclaimer: Although I chose to work on a competition hosted by an insurance company, there is no overlap between my hobby of data science research and my responsibilities at AIG. Only personal computing resources, personal free time, and competition provided data was used.