The dataset we're going to analyze contains login and email send rates, categorized by email type, by user. These days it's simple to track email sends to opens to clicks and have the call to action in the email be a login. But what if not all your emails have a call to action? What are the affects of multiple emails and multiple email types on user logins? Can you maximize login rate by the type of email you send your users?
In this exercise, we'll take a look at three different types of emails and analyze their probabilities for a login.
In our dataset we have five columns -- user_id, login, welcome_email, newsletter_email, and targeted_email (i.e., personalized email). The independent variables are the email types. The dependent variable is login, specifically, which captures whether or not the user logged in +/- 3 days after receiving and email. The dependent variable needs to be either a 1 or 0, and in this case, 1 indicates a user login while 0 indicates no login.
# import libraries
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
# load the data into a DataFrame
df = pd.read_csv('emails.csv')
For larger datasets, I would suggest querying your database directly rather than uploading a csv file. You'll need to download a python wrapper to query your database directly from python. For example, psycopg2 would be needed to query postgres databases.
# ensure that the dataset loaded successfully
print df.head()
# remove the user_id column since it's not needed for our regression
del df['user_id']
# print out some descriptive statistics to get a sense of the data
print df.describe()
Our data contains over 1 million users, with users receiving an average of 0.23 welcome type emails, 2.7 newsletters, and 0.074 targeted emails in the timeframe captured (in this case 3 months).
Our independent variables are a measure of the number of emails a user received, categorized by the type of email. But what if the independent variable contains category types? For example, what if we had an independent cateogry called prestige
that describes the prestige of the college a student is admitted into? Data within prestige
is a number between one and four, indicating the level of prestige for a given college. Our logit regression won't know how to deal with the numbers 1 to 4, so we would need to separate prestige
into four independent variables, with 1s and 0s for each prestige level. This way a student would be given a 1 or 0 for the variables prestige_1
, prestige_2
, prestige_3
, and prestige_4
.
Fortuntely, this is easy using pandas
. We can use the get_dummies
function to create a new DataFrame
that breaks prestige
up into four columns with binary values for each category.
For more information, go here: http://blog.yhat.com/posts/logistic-regression-python-rodeo.html
# since we're using statsmodels, we need to manually add the intercept
df['intercept'] = 1.0
# identify the independent variables
ind_vars = df.columns[1:]
# train on dependent variables
logit = sm.Logit(df['login'], df[ind_vars])
# fit the model and print out the output summary
result = logit.fit()
print result.summary()
The output gives you a breakdown of the input parameters and output analysis. A sanity check reveils that the dependent variable was login and that there were over 1 million observations. We also see the list of independent variables (email types), their coefficients, significance, and confidence interval.
It's difficult to make sense of the coefficients because it's presented as the coefficients for the natural log of the odds ratio:
ln[p/(1-p)] = a + BX + e
So in order to better understand how a one unit increase or decrease in email type affects the odds of a user logging in, we'll need to take the exponential of the coefficients:
[p/(1-p)] = exp(a + BX + e)
Let's also add in the confidence interval so that we have an understanding of the range of values.
# output the odds ratios and 95% CI
conf = result.conf_int()
conf['odds_ratio'] = result.params
conf.columns = ['2.5%', '97.5%', 'odds_ratio']
print np.exp(conf)
So what does our results mean? What does a 0.999 odds ratio coefficient for welcome emails mean?
Odds Ratios [p/(1-p)] The odds ratio is the probability of event occuring divided by the probability of event not occuring
Confidence Interval The confidence interval gives us an idea of the range of values we can expect (or how confidence we are that the odds ratio is within the calculated range).
Signifiance (P-value)
The logit results indicated that welcome emails had a p-value
of 0.05. This means that if the welcome emails had no effect on login, we'd obtain similar user behavior in 5% of our data due to random sampling error. The same logic applies to the other variables.
What about the affect of multiple email types to login? What would be the predicted probabilities of logins? We'll cover this topic next.