post

Predicting auto insurance claims with deep learning

Part of the fun of learning data science is seeing how quickly it can relate to your usual roles and responsibilities. While AIG sells insurance, I catch criminals. While underwriters make predictions, I protect data. So when I needed a testbed to practice deep learning and better understand a business perspective, I turned to Kaggle’s Porto Seguro’s Safe Driver Competition.

This competition is fun because you are asked to “… build a model that predicts the probability that a driver will initiate an auto insurance claim in the next year.” To save you time, you should know our team did not win, and the 1st place winner’s submission is a fantastic read. However, I did want to cover some of the critical lessons I learned.

The Fast.ai cohort and building a team

Currently, Jeremy Howard is teaching the Fast.ai part 1 to a group of local and international students. A 7-week course it covers many of the fundamentals of setting up and running a model to drive results. Practicality first, technicalities second, taking the free classes were so good I had to apply again when he switched from Keras to PyTorch. Many of the cohorts have some fantastic articles (here, here, here) based on what we are learning from class.

For me, competing head-to-head against other data scientists helps solidify my learning, and so once again I turned to Kaggle. Since I had previously worked on time-series predictions for web traffic, I found the Porto competition especially tempting.

A great thing about the Fast.ai cohort is that you can quickly find someone who is also interested in a similar project. Fortunately, I was able to team up with Devan Govender, another student participating.

Due to time limitations. We only had about 8 days to work on the project. This accelerated timeline was great because it forced us to move into the project quickly.

Sharing Data

I have to admit that my Github skills are lacking. Due to the Kaggle competition rules Devan and me set-up a private Github instance to share information back and forth. At the time, there was not a way to install the Fast.ai repo with pip through the Kaggle interface.

Some things I enjoyed about a private GitHub instance continues to be the ease of sharing ideas back and forth. It took only minutes to be able to run what Devan had uploaded. Plus there were more than a couple of times that GitHub provided a way to get back to a known good state.

A few commands became my bread and butter for using GitHub.

Clone- gets a copy of the project I am working on

Status and Pull- We can see that after the clone command, we have the folder with the GitHub code. Additionally, we can check status (it is up to date) and try a pull (again it is up to date). Extremely important before we start making changes to the code.

Push- After we have made our changes we label the modifications that we fill in a commit about the changes made and the files changed. Then we push it back to GitHub for someone else to use.

Additionally, in the Jupyter notebook, I set-up the code so that we would not have to change too many things do to pathing.

Also, now that the competitions are over, we can release the code allowing everyone to see it in its unfiltered madness.

Getting data in the right place

Unlike other competitions, Porto’s data is anonymized more than I would expect. The data columns labels have nondescript categories, but at least the columns are labeled into the types of data such as continuous, categorical and boolean.

Incorrect features can be a real problem for records and to correctly use them. I can barely understand what they are trying to do here. It is much more difficult to go back and interpret the meaning of the values provided.

Fortunately, we can work through it. For example, there could be a category with a value of 1 in it could be interpreted in the following way:

  • a boolean value: True the insured car was in an accident
  • As a categorical value: The insured vehicle is a Ford
  • Continuous value: This car has gone 1 mile

However, what happens if the next record was 3 and how would that describe the relationship?

  • A boolean value would not make sense because booleans are only yes or no.
  • A categorical value would suggest the car is a Ferrari, not a Ford. This alteration of models could drastically change the chance of a claim.
  • It could be continuous, but the difference between a car with 1 vs. 3 is likely insignificant.

As you can see accidentally mislabeling the value can have a significant effect on the data.

Luckily, in this competition, we were told which value categories. However, I wanted to doublecheck them. So I ran some analytics

  • A boolean should at most have 3 values (True, False, NaN)
  • A category column will likely be in the double, but not triple digits.
  • A continuous will have many many unique values. Going back to the mileage example, imagine all the different mileage counts that would be available. Almost every car would have a unique category! One for cars with 1 miles, 2 miles, 3 miles… etc

We can see below that the cats, bools, and conts all make lots of sense. At least we are not as blind as we were before.

Boolean and categories and continuous oh my…

The most significant oversite I missed was how many variables were missing and how to solve for them (some categories had over 50% missing). We see these as nan values, represented as a -1, in the code. Now depending on the type of the data there can

  • Boolean and categories can easily just add the nan as an additional category. Not having a value can sometimes give just as much information as having it.
  • The continuous variables are a little tricky. Leaving them with a default of -1 can be goofy. Assuming that the model rated low mileage vehicles favorably, any car missing a value would be rated more favorably than brand new cars! What we tried to do late minute was just take the average of the other values to ensure that it did not impact the prediction.

I think these methods worked out fine, but it goes to point out the difficulty in working with data anonymized in this manner.

Reverse engineering features

I had major problems with training my data. The first thing I tried to do was to properly go back and classify the number of unique values in each category. This check helped ensure the data was correctly labeled or could not be improved. Even with the alterations, we continued to have problems.

Getting the right learning rate seemed fickle and when we were using the Gini coefficient, it took some time to move downwards. There were just too many things to calculate.

At this point, we saw that most competitors dropped the less important features. The last-ditch effort was attempting to remove as many as I could to better understand what might have going wrong. It did not help much.

Takeaways

When the dust settled, we placed in the Top 16% of over 5,000 submissions. I certainly learned lots about how to coordinate better with my partner (thanks Devan!) and the importance of understanding the different types of features. However, more it helped me realize how important it is when collecting data to know what it looks like and how to represent data. We are still long ways off from just throwing all our data into a black box to see what magic pops out the other side.

Disclaimer: Although I chose to work on a competition hosted by an insurance company, there is no overlap between my hobby of data science research and my responsibilities at AIG. Only personal computing resources, personal free time, and competition provided data was used. 

post

9-months in the “hobby” of deep learning

Deep learning, AI, machine learning, and all of those others buzzwords are spouting out everywhere. No domain is safe from marketers trying to use these terms to sell a product, and no startup would be caught dead without them (or blockchain). So to enact my due diligence I wanted to jump on the deep learning bandwagon. The problem was my plate has been very full. These past 9-months had:

  • Daddy duties
  • Husband duties
  • Work duties
  • A month of Executive courses
  • Preparing for DEFCON
  • Fighting off a hurricane Harvey

So can a professional just take up deep learning as a hobby? Sure can!

I had tried going through Andrew Ng’s Coursera course, but I quickly got sidetracked. Fortunately, I discovered Fast.AI (Jeremy Howard and Rachel Thomas) and launched myself through the first two modules. I even got accepted into their follow-up part 1v2 as an international fellow. Despite all the time factors, there is something addictive about getting my hands dirty running and altering the scripts.

The Rig and the joy of GPUs for Deep Learning

Some people love their cars or their guitars, but I am passionate about my computers. Lesson 1 form Jeremy is setting up an AWS instance. While the AWS instance worked, I quickly decided I need to take advantage of a GTX 1080 that sat idle most days.

There are several links (herehere, and here) describing the best way to make a deep learning rig. Fortunately, it is mainly just a gaming machine putting on adult clothing, and it only took a little bit of tweaking to get scripts running. The most significant change I needed was buying an SSD to hold my training data on and install a new version of Ubuntu.

Setting up SSH to allow me to log in remotely has also been vital. Every morning I can spend about an hour drinking my coffee and getting ready for the day to start. Having a remote connection to my rig allows me to quickly pick up exactly where I left off and not need to carry around the machine with me. Indeed, my deep learning computer does not even have a monitor because I merely login remotely, even at home.

How have I not used GitHub?

I have known about GitHub for quite some time, but I have not routinely used it. These last few months I have gone from one project with 9 lines of code to about nine notebooks. I feel like I am barely scratching the surface. I pull, push and clone but there is so much more I have not got to yet. It does allow me to quickly update and share my updates which I find valuable in case my rig went up in flames.

The Jupyter notebook

It was also shocking how much I have grown to love using the Jupyter notebooks. All the documentation, and saved outputs readily repeatable. Troubleshooting is amazingly more comfortable for me, and a large part of the data is just making sure I accurately see the formats for it. Jupyter gives that to me in an easy to understand way. I wish I used it back with Kali for pentesting documentation so that everything is both rapidly reproducible and documented.

The best features when dealing with larger training sets were the timing features for individual blocks. Having the ability to see how long an iteration takes and having a verbal warning when something completes is very valuable. If data is taking 30 minutes to load, finding an alternative loading mechanism makes much more sense until you need it.

Are you solving problems?

If you are expecting to solve world hunger, we might be a ways off. However, an excellent standby for testing what you have learned is with Kaggle competitions. The fast.ai course has plenty of real-world problems with real-world data. Seeing what other groups are solving has been helping me think about what I can apply to work immediately. Not with a billion dollar budget, but with what I have right now. Here are my three favorites.

Cats and Dogs — Kaggle Competition

Everyone needs to figure out how to better identify cats and dogs. This contest goes out of its way to keep the fight alive. Using several pre-computed models users can predict if an image is a cat or dog. In my 2000 images, results in only 13 incorrect answers. Here are some random examples of the correct pictures.

So that is pretty good, and these predictions all makes sense. However, when we look at the incorrect cats, we see the following.

We can see why a computer might get these wrong. These are bad pictures. My two year old wouldn’t get these either. The beauty of the project draws from its simplicity and ease of understanding. A great first project.

Statefarm- Kaggle Competition

I find this one much more interesting since it classifies human behavior. The images are cut up into multiple categories trying to show that humans are doing silly things while driving. While there are several defined activities, it is fun to catch people being a jerk in many different ways. Most of the distracted driving is simple.

Is the driver distracted and if so how? Are they texting, yelling at someone on their phone, drinking a soda, or something else? These classifications are extremely easy for a person to describe. However, it has been only recently that you can start thinking about how to get a machine to learn them.

As an aside, you can really mess with the dataset for this one. If you decide to have drivers in the training and validation set, you can add bias to the models. For whatever reason, my first interactions incorrectly labeled drivers with glasses as distracted. Every. Time. Upon review, I discovered my all my unsafe class predictions for a particular category had glasses! Imagine an employee sending a dataset to production that made the same mistake with hair length or skin color. It is terrifying.

On the plus side, I feel that this could have the most profound impact on behavior. Imagine people calling out when they are slouching in chairs. Alternatively, if a child is climbing on something that is too dangerous. Alerts, warning, corrections can help keep people safe and drive changes in behavior to make us better and safer.

Web Traffic Time Series Forecasting- Kaggle Competition

I go into this competition more in-depth here. However, since posting the contest has ended and I placed in the top 30%, this is pretty high up there considering it was the first competition I went into alone and unafraid.

The misunderstanding I had here (cat vs. cont variables) I was able to work on in my 2nd competition. So I am still thrilled about placement. Additionally, it started off my deep love for non-picture type problems. Plus there are two other similar examples right now which I will likely compete in.

My next steps

In case you want to look at some unfinished code you can check out my work here on GitHub. However, I would recommend that instead you go and start taking a look Jeremy’s course to get into deep learning. Seriously, take a day off work and try this. I know I learned mountains from it and think the way he approaches the teaching of others fantastic.

Its not THAT hard.