This module dives into machine learning algorithms, specifically Random Forest, to predict events based on a set of attributes. This can be used for offer users specific recommendations based on their information or it can be used to assess the most important product features that leads to users performing specific actions. Whatever question you're trying to solve, machine learning techniques can be used to better understand complex interactions and attributes that are often difficult to tease out using standard statistical methods like t-tests.
This module acts more as a framework for you to implement whatever machine learning algorithm you want to implement, as I'm concentrating more on the process and implementation rather than the specific statistical model itself.
We're going to use a dummy dataset that lists people that have purchased health insurance. People are listed by their attributes (e.g., gender, age, industry they work in, household size, the number of children in their household, and the user's relationship to the person that purchased the health insurance. Lastly, we list whether or not the person has registered on the insurer's website (this is our target variable).
#import libraries
import pandas as pd
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.ensemble import RandomForestClassifier
#import dataset as a pandas dataframe
df = pd.read_csv('dataset.csv')
90% of data science is data prep and cleaning. After importing our dataset, we want to make sure we understand how python is interpreting every variable and datapoint. We want to also dig around to see if there are any nulls or other values that might throw off the analysis.
You should always perform the below three commands on your dataset.
#describe dataset
print df.head()
print df.info()
print df.describe()
Dealing with dirty data
It looks like python was able to interpret the data types correctly. Wherever there is a number, python interpretes the datatype as float or int, and wherever there's a string, the datatype is an object.
However, it looks like we have a few nulls in the household_size variable so we need to deal with this accordingly. I'm just going to drop the rows where household_size is null.
# deal with nulls and nas
df = df.dropna(how='any', subset=['household_size'])
df.info()
Processing and handling numerical data
If some variables have a large variance and some small, our model will bias towards the large variances. For example if you change one variable from km to cm (increasing its variance), it may go from having little impact to dominating all the other variables in the model.
If you want your model to be independent of such rescaling, standardizing the variables will do that. On the other hand, if the specific scale of your variables matters (in that you want your model to be in that scale), maybe you don't want to standardize.
In this example, we're going to normalize by transforming the vector so that it has unit norm. When data are thought of as random variables, normalizing means transforming to normal distribution. When the data are hypothesized to be normal, normalizing means transforming to unit variance.
# normalize data
df['age_normal'] = (df['age']- df['age'].mean())/(df['age'].std())
df['hh_normal'] = (df['household_size']- df['household_size'].mean())/(df['household_size'].std())
df['deps_normal'] = (df['dependents']- df['dependents'].mean())/(df['dependents'].std())
Dummifying the data
Since some of our variables are categorical, we need to split out the different categories into variables. For example, gender can be male or female. When we dummify, we make male and female into variables of their own (let's name the variables gender_male and gender_female). A male will then have a 1 under gender_male and a 0 under gender_female, thus transforming a categorical variable into a numerical one.
#dummify data
dummy_gender = pd.get_dummies(df.gender, prefix='gender')
dummy_relationship = pd.get_dummies(df.relationship, prefix='relationship')
dummy_industry = pd.get_dummies(df.industry, prefix='industry')
Now that we've transformed our categorical variables into numerical ones, we can delete the categorical variables and join the dummy variables into the original dataframe.
#drop original columns that have been normalized or dummified
df = df.drop(['gender','relationship','age','industry','household_size','dependents'], axis=1)
#join dummy data columns to dataframe
df = df.join(dummy_gender)
df = df.join(dummy_relationship)
df = df.join(dummy_industry)
#let's take a look at our new dataframe
df.info()
We're going to be using a Random Forest Classifier which constructs several decision trees to generate classifications (i.e., predictions) based on variables in the dataset. Briefly, a random forest is made up of many decisions trees where each tree gives a classification (i.e., votes). The forest then chooses the classification having the most votes.
We'll split our variables into features and target. The target is the classification itself -- did the user register or not? The features are the user attributes such as their age, household size, number of dependents, industry, and gender. So the question is -- can we accurately predict whether or not a user will register based on their attributes?
# identify the feature and target dataset
features = df[1:].values
target = df['registered'].values
We want to split our dataset into a training and test set. Our model with train the algorithm using the training set and then test the model on the test set, giving us an accuracy score as an output. We also want to estimate how accurately our model makes predictions. So in order to do both we use KFold() which will give us as many indexes as we want for training and testing purposes.
def cross_validate(features, target, classifier, k_fold) :
'''Calculates average accuracy of classification
algorithm using kfold crossvalidation'''
# derive a set of (random) training and testing indices
k_fold_indices = KFold(len(features), n_folds=k_fold,
shuffle=True, random_state=0)
# for each training and testing slices run the classifier, and score the results
k_score_total = 0
for train_slice, test_slice in k_fold_indices :
model = classifier.fit(features[train_slice],
target[train_slice])
k_score = model.score(features[test_slice],
target[test_slice])
k_score_total += k_score
# return the average accuracy
return k_score_total/k_fold
We also want to test the accuracy of our model as we change the number of decision trees in the forest. We want to see how the number of trees affect the accuracy of our results and hopefully find a maximum.
for n in range(10,51,10):
model = RandomForestClassifier(n)
score = cross_validate(features, target, model, 5)
print '{0} estimators for RF - {1}'.format(n, score)
Our results show that we can achieve ~66% accuracy in predicting whether or not a user registers based on their attributes. That's not really very accurate but it's better than a coin flip.