Logistic Regression on email types to logins

The dataset we're going to analyze contains login and email send rates, categorized by email type, by user. These days it's simple to track email sends to opens to clicks and have the call to action in the email be a login. But what if not all your emails have a call to action? What are the affects of multiple emails and multiple email types on user logins? Can you maximize login rate by the type of email you send your users?

In this exercise, we'll take a look at three different types of emails and analyze their probabilities for a login.

Cleaning and preparing the data

In our dataset we have five columns -- user_id, login, welcome_email, newsletter_email, and targeted_email (i.e., personalized email). The independent variables are the email types. The dependent variable is login, specifically, which captures whether or not the user logged in +/- 3 days after receiving and email. The dependent variable needs to be either a 1 or 0, and in this case, 1 indicates a user login while 0 indicates no login.

In [ ]:
# import libraries
import pandas as pd 
import statsmodels.api as sm 
import pylab as pl 
import numpy as np
In [ ]:
# load the data into a DataFrame
df = pd.read_csv('emails.csv')

For larger datasets, I would suggest querying your database directly rather than uploading a csv file. You'll need to download a python wrapper to query your database directly from python. For example, psycopg2 would be needed to query postgres databases.

In [31]:
# ensure that the dataset loaded successfully
print df.head()
    user_id  login  welcome_email  newsletter_email  targeted_email
0   2979116      0              4                20               3
1   8642741      0              0                 0               0
2   8067968      0              0                32               0
3  10770610      0              0                 0               0
4   7257140      0              0                 0               0
In [32]:
# remove the user_id column since it's not needed for our regression
del df['user_id']
In [33]:
# print out some descriptive statistics to get a sense of the data
print df.describe()
              login  welcome_email  newsletter_email  targeted_email
count  1.061498e+06   1.061498e+06      1.061498e+06    1.061498e+06
mean   1.238175e-01   2.339938e-01      2.705382e+00    7.379194e-02
std    3.293734e-01   1.328870e+00      5.979231e+00    5.016193e-01
min    0.000000e+00   0.000000e+00      0.000000e+00    0.000000e+00
25%    0.000000e+00   0.000000e+00      0.000000e+00    0.000000e+00
50%    0.000000e+00   0.000000e+00      0.000000e+00    0.000000e+00
75%    0.000000e+00   0.000000e+00      3.000000e+00    0.000000e+00
max    1.000000e+00   1.510000e+02      1.790000e+02    3.400000e+01

Our data contains over 1 million users, with users receiving an average of 0.23 welcome type emails, 2.7 newsletters, and 0.074 targeted emails in the timeframe captured (in this case 3 months).

Dealing with categorical variables

Our independent variables are a measure of the number of emails a user received, categorized by the type of email. But what if the independent variable contains category types? For example, what if we had an independent cateogry called prestige that describes the prestige of the college a student is admitted into? Data within prestige is a number between one and four, indicating the level of prestige for a given college. Our logit regression won't know how to deal with the numbers 1 to 4, so we would need to separate prestige into four independent variables, with 1s and 0s for each prestige level. This way a student would be given a 1 or 0 for the variables prestige_1, prestige_2, prestige_3, and prestige_4.

Fortuntely, this is easy using pandas. We can use the get_dummies function to create a new DataFrame that breaks prestige up into four columns with binary values for each category.

For more information, go here: http://blog.yhat.com/posts/logistic-regression-python-rodeo.html

Performing the regression

In [34]:
# since we're using statsmodels, we need to manually add the intercept
df['intercept'] = 1.0
In [36]:
# identify the independent variables
ind_vars = df.columns[1:]

# train on dependent variables
logit = sm.Logit(df['login'], df[ind_vars])

# fit the model and print out the output summary
result = logit.fit()
print result.summary()
Optimization terminated successfully.
         Current function value: 0.345503
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                  login   No. Observations:              1061498
Model:                          Logit   Df Residuals:                  1061494
Method:                           MLE   Df Model:                            3
Date:                Thu, 20 Oct 2016   Pseudo R-squ.:                 0.07734
Time:                        21:41:30   Log-Likelihood:            -3.6675e+05
converged:                       True   LL-Null:                   -3.9749e+05
                                        LLR p-value:                     0.000
                       coef    std err          z      P>|z|      [95.0% Conf. Int.]
welcome_email        0.0038      0.002      1.958      0.050     -4.15e-06     0.008
newsletter_email     0.0914      0.000    217.685      0.000         0.091     0.092
targeted_email       0.3356      0.005     69.082      0.000         0.326     0.345
intercept           -2.3468      0.004   -642.702      0.000        -2.354    -2.340

The output gives you a breakdown of the input parameters and output analysis. A sanity check reveils that the dependent variable was login and that there were over 1 million observations. We also see the list of independent variables (email types), their coefficients, significance, and confidence interval.

It's difficult to make sense of the coefficients because it's presented as the coefficients for the natural log of the odds ratio:

                                ln[p/(1-p)] = a + BX + e 

So in order to better understand how a one unit increase or decrease in email type affects the odds of a user logging in, we'll need to take the exponential of the coefficients:

                                [p/(1-p)] = exp(a + BX + e)

Let's also add in the confidence interval so that we have an understanding of the range of values.

In [37]:
# output the odds ratios and 95% CI
conf = result.conf_int()
conf['odds_ratio'] = result.params
conf.columns = ['2.5%', '97.5%', 'odds_ratio']
print np.exp(conf)
                      2.5%     97.5%  odds_ratio
welcome_email     0.999996  1.007682    1.003832
newsletter_email  1.094853  1.096658    1.095755
targeted_email    1.385575  1.412216    1.398832
intercept         0.094992  0.096362    0.095675

Interpreting the results

So what does our results mean? What does a 0.999 odds ratio coefficient for welcome emails mean?

Odds Ratios [p/(1-p)] The odds ratio is the probability of event occuring divided by the probability of event not occuring

  • welcome email: 0.99 = 1/1 = so the probability of a welcome email leading to a user login is 50%
  • newsletter email: 1.09 = 1/1 = so the probabilty of a newsletter leading to a user login is 50%
  • targeted email: 1.4 = 7/5 = so the probabilty of a targeted email leading to a user login is 58%

Confidence Interval The confidence interval gives us an idea of the range of values we can expect (or how confidence we are that the odds ratio is within the calculated range).

Signifiance (P-value) The logit results indicated that welcome emails had a p-value of 0.05. This means that if the welcome emails had no effect on login, we'd obtain similar user behavior in 5% of our data due to random sampling error. The same logic applies to the other variables.

But what about compound effects?

What about the affect of multiple email types to login? What would be the predicted probabilities of logins? We'll cover this topic next.