Machine Learning with Random Forest and Cross Validation

This module dives into machine learning algorithms, specifically Random Forest, to predict events based on a set of attributes. This can be used for offer users specific recommendations based on their information or it can be used to assess the most important product features that leads to users performing specific actions. Whatever question you're trying to solve, machine learning techniques can be used to better understand complex interactions and attributes that are often difficult to tease out using standard statistical methods like t-tests.

This module acts more as a framework for you to implement whatever machine learning algorithm you want to implement, as I'm concentrating more on the process and implementation rather than the specific statistical model itself.

Data import, cleaning, and preparing

We're going to use a dummy dataset that lists people that have purchased health insurance. People are listed by their attributes (e.g., gender, age, industry they work in, household size, the number of children in their household, and the user's relationship to the person that purchased the health insurance. Lastly, we list whether or not the person has registered on the insurer's website (this is our target variable).

In [1]:
#import libraries
import pandas as pd
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.ensemble import RandomForestClassifier
In [2]:
#import dataset as a pandas dataframe
df = pd.read_csv('dataset.csv')

90% of data science is data prep and cleaning. After importing our dataset, we want to make sure we understand how python is interpreting every variable and datapoint. We want to also dig around to see if there are any nulls or other values that might throw off the analysis.

You should always perform the below three commands on your dataset.

In [3]:
#describe dataset
print df.head()
print df.describe()
   registered gender  age                           industry  household_size  \
0           0      M   51  Health Care and Social Assistance             2.0   
1           0      F   29  Health Care and Social Assistance             2.0   
2           0      M   38  Health Care and Social Assistance             2.0   
3           0      F   39  Health Care and Social Assistance             4.0   
4           0      F   32  Health Care and Social Assistance             3.0   

   dependents relationship  
0           0       spouse  
1           0       spouse  
2           0         self  
3           1       spouse  
4           1         self  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
registered        1000 non-null int64
gender            1000 non-null object
age               1000 non-null int64
industry          1000 non-null object
household_size    850 non-null float64
dependents        1000 non-null int64
relationship      1000 non-null object
dtypes: float64(1), int64(3), object(3)
memory usage: 54.8+ KB
        registered          age  household_size  dependents
count  1000.000000  1000.000000      850.000000  1000.00000
mean      0.379000    36.966000        3.041176     0.45500
std       0.485381    13.541574        1.685369     0.49822
min       0.000000    18.000000        1.000000     0.00000
25%       0.000000    24.000000             NaN     0.00000
50%       0.000000    35.000000             NaN     0.00000
75%       1.000000    48.000000             NaN     1.00000
max       1.000000    79.000000        9.000000     1.00000
/Users/nrosidi/anaconda2/lib/python2.7/site-packages/numpy/lib/ RuntimeWarning: Invalid value encountered in percentile

Dealing with dirty data

It looks like python was able to interpret the data types correctly. Wherever there is a number, python interpretes the datatype as float or int, and wherever there's a string, the datatype is an object.

However, it looks like we have a few nulls in the household_size variable so we need to deal with this accordingly. I'm just going to drop the rows where household_size is null.

In [4]:
# deal with nulls and nas
df = df.dropna(how='any', subset=['household_size'])
<class 'pandas.core.frame.DataFrame'>
Int64Index: 850 entries, 0 to 999
Data columns (total 7 columns):
registered        850 non-null int64
gender            850 non-null object
age               850 non-null int64
industry          850 non-null object
household_size    850 non-null float64
dependents        850 non-null int64
relationship      850 non-null object
dtypes: float64(1), int64(3), object(3)
memory usage: 53.1+ KB

Processing and handling numerical data

If some variables have a large variance and some small, our model will bias towards the large variances. For example if you change one variable from km to cm (increasing its variance), it may go from having little impact to dominating all the other variables in the model.

If you want your model to be independent of such rescaling, standardizing the variables will do that. On the other hand, if the specific scale of your variables matters (in that you want your model to be in that scale), maybe you don't want to standardize.

In this example, we're going to normalize by transforming the vector so that it has unit norm. When data are thought of as random variables, normalizing means transforming to normal distribution. When the data are hypothesized to be normal, normalizing means transforming to unit variance.

In [5]:
# normalize data
df['age_normal'] = (df['age']- df['age'].mean())/(df['age'].std())
df['hh_normal'] = (df['household_size']- df['household_size'].mean())/(df['household_size'].std())
df['deps_normal'] = (df['dependents']- df['dependents'].mean())/(df['dependents'].std())

Dummifying the data

Since some of our variables are categorical, we need to split out the different categories into variables. For example, gender can be male or female. When we dummify, we make male and female into variables of their own (let's name the variables gender_male and gender_female). A male will then have a 1 under gender_male and a 0 under gender_female, thus transforming a categorical variable into a numerical one.

In [6]:
#dummify data
dummy_gender = pd.get_dummies(df.gender, prefix='gender')
dummy_relationship = pd.get_dummies(df.relationship, prefix='relationship')
dummy_industry = pd.get_dummies(df.industry, prefix='industry')

Now that we've transformed our categorical variables into numerical ones, we can delete the categorical variables and join the dummy variables into the original dataframe.

In [7]:
#drop original columns that have been normalized or dummified
df = df.drop(['gender','relationship','age','industry','household_size','dependents'], axis=1)
In [8]:
#join dummy data columns to dataframe
df = df.join(dummy_gender)
df = df.join(dummy_relationship)
df = df.join(dummy_industry)
In [9]:
#let's take a look at our new dataframe
<class 'pandas.core.frame.DataFrame'>
Int64Index: 850 entries, 0 to 999
Data columns (total 21 columns):
registered                                                   850 non-null int64
age_normal                                                   850 non-null float64
hh_normal                                                    850 non-null float64
deps_normal                                                  850 non-null float64
gender_F                                                     850 non-null float64
gender_M                                                     850 non-null float64
relationship_self                                            850 non-null float64
relationship_spouse                                          850 non-null float64
industry_Educational Services                                850 non-null float64
industry_Finance and Insurance                               850 non-null float64
industry_Health Care and Social Assistance                   850 non-null float64
industry_Information                                         850 non-null float64
industry_Manufacturing                                       850 non-null float64
industry_Other Services (except Public Administration)       850 non-null float64
industry_Professional, Scientific, and Technical Services    850 non-null float64
industry_Public Administration                               850 non-null float64
industry_Retail Trade                                        850 non-null float64
industry_Transportation and Warehousing                      850 non-null float64
industry_Unknown                                             850 non-null float64
industry_Utilities                                           850 non-null float64
industry_Wholesale Trade                                     850 non-null float64
dtypes: float64(20), int64(1)
memory usage: 146.1 KB

Preparing the model

We're going to be using a Random Forest Classifier which constructs several decision trees to generate classifications (i.e., predictions) based on variables in the dataset. Briefly, a random forest is made up of many decisions trees where each tree gives a classification (i.e., votes). The forest then chooses the classification having the most votes.

We'll split our variables into features and target. The target is the classification itself -- did the user register or not? The features are the user attributes such as their age, household size, number of dependents, industry, and gender. So the question is -- can we accurately predict whether or not a user will register based on their attributes?

In [11]:
# identify the feature and target dataset
features = df[1:].values
target = df['registered'].values

We want to split our dataset into a training and test set. Our model with train the algorithm using the training set and then test the model on the test set, giving us an accuracy score as an output. We also want to estimate how accurately our model makes predictions. So in order to do both we use KFold() which will give us as many indexes as we want for training and testing purposes.

In [13]:
def cross_validate(features, target, classifier, k_fold) :
    '''Calculates average accuracy of classification 
    algorithm using kfold crossvalidation'''
    # derive a set of (random) training and testing indices
    k_fold_indices = KFold(len(features), n_folds=k_fold,
                           shuffle=True, random_state=0)
    # for each training and testing slices run the classifier, and score the results
    k_score_total = 0
    for train_slice, test_slice in k_fold_indices :
        model =[train_slice],
        k_score = model.score(features[test_slice],
        k_score_total += k_score
    # return the average accuracy
    return k_score_total/k_fold

We also want to test the accuracy of our model as we change the number of decision trees in the forest. We want to see how the number of trees affect the accuracy of our results and hopefully find a maximum.

In [14]:
for n in range(10,51,10):
    model = RandomForestClassifier(n)
    score = cross_validate(features, target, model, 5)
    print '{0} estimators for RF - {1}'.format(n, score)
10 estimators for RF - 0.652593108249
20 estimators for RF - 0.654959972155
30 estimators for RF - 0.656122520014
40 estimators for RF - 0.65847546119
50 estimators for RF - 0.659644970414

Our results show that we can achieve ~66% accuracy in predicting whether or not a user registers based on their attributes. That's not really very accurate but it's better than a coin flip.