The dataset we're going to analyze contains login and email send rates, categorized by email type, by user. These days it's simple to track email sends to opens to clicks and have the call to action in the email be a login. But what if not all your emails have a call to action? What are the affects of multiple emails and multiple email types on user logins? Can you maximize login rate by the type of email you send your users?
In this exercise, we'll take a look at three different types of emails and analyze their probabilities for a login.
In our dataset we have five columns -- user_id, login, welcome_email, newsletter_email, and targeted_email (i.e., personalized email). The independent variables are the email types. The dependent variable is login, specifically, which captures whether or not the user logged in +/- 3 days after receiving and email. The dependent variable needs to be either a 1 or 0, and in this case, 1 indicates a user login while 0 indicates no login.
# import libraries import pandas as pd import statsmodels.api as sm import pylab as pl import numpy as np
# load the data into a DataFrame df = pd.read_csv('emails.csv')
For larger datasets, I would suggest querying your database directly rather than uploading a csv file. You'll need to download a python wrapper to query your database directly from python. For example, psycopg2 would be needed to query postgres databases.
# ensure that the dataset loaded successfully print df.head()
user_id login welcome_email newsletter_email targeted_email 0 2979116 0 4 20 3 1 8642741 0 0 0 0 2 8067968 0 0 32 0 3 10770610 0 0 0 0 4 7257140 0 0 0 0
# remove the user_id column since it's not needed for our regression del df['user_id']
# print out some descriptive statistics to get a sense of the data print df.describe()
login welcome_email newsletter_email targeted_email count 1.061498e+06 1.061498e+06 1.061498e+06 1.061498e+06 mean 1.238175e-01 2.339938e-01 2.705382e+00 7.379194e-02 std 3.293734e-01 1.328870e+00 5.979231e+00 5.016193e-01 min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 25% 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 50% 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 75% 0.000000e+00 0.000000e+00 3.000000e+00 0.000000e+00 max 1.000000e+00 1.510000e+02 1.790000e+02 3.400000e+01
Our data contains over 1 million users, with users receiving an average of 0.23 welcome type emails, 2.7 newsletters, and 0.074 targeted emails in the timeframe captured (in this case 3 months).
Our independent variables are a measure of the number of emails a user received, categorized by the type of email. But what if the independent variable contains category types? For example, what if we had an independent cateogry called
prestige that describes the prestige of the college a student is admitted into? Data within
prestige is a number between one and four, indicating the level of prestige for a given college. Our logit regression won't know how to deal with the numbers 1 to 4, so we would need to separate
prestige into four independent variables, with 1s and 0s for each prestige level. This way a student would be given a 1 or 0 for the variables
Fortuntely, this is easy using
pandas. We can use the
get_dummies function to create a new
DataFrame that breaks
prestige up into four columns with binary values for each category.
For more information, go here: http://blog.yhat.com/posts/logistic-regression-python-rodeo.html
# since we're using statsmodels, we need to manually add the intercept df['intercept'] = 1.0
# identify the independent variables ind_vars = df.columns[1:] # train on dependent variables logit = sm.Logit(df['login'], df[ind_vars]) # fit the model and print out the output summary result = logit.fit() print result.summary()
Optimization terminated successfully. Current function value: 0.345503 Iterations 6 Logit Regression Results ============================================================================== Dep. Variable: login No. Observations: 1061498 Model: Logit Df Residuals: 1061494 Method: MLE Df Model: 3 Date: Thu, 20 Oct 2016 Pseudo R-squ.: 0.07734 Time: 21:41:30 Log-Likelihood: -3.6675e+05 converged: True LL-Null: -3.9749e+05 LLR p-value: 0.000 ==================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------------ welcome_email 0.0038 0.002 1.958 0.050 -4.15e-06 0.008 newsletter_email 0.0914 0.000 217.685 0.000 0.091 0.092 targeted_email 0.3356 0.005 69.082 0.000 0.326 0.345 intercept -2.3468 0.004 -642.702 0.000 -2.354 -2.340 ====================================================================================
The output gives you a breakdown of the input parameters and output analysis. A sanity check reveils that the dependent variable was login and that there were over 1 million observations. We also see the list of independent variables (email types), their coefficients, significance, and confidence interval.
It's difficult to make sense of the coefficients because it's presented as the coefficients for the natural log of the odds ratio:
ln[p/(1-p)] = a + BX + e
So in order to better understand how a one unit increase or decrease in email type affects the odds of a user logging in, we'll need to take the exponential of the coefficients:
[p/(1-p)] = exp(a + BX + e)
Let's also add in the confidence interval so that we have an understanding of the range of values.
# output the odds ratios and 95% CI conf = result.conf_int() conf['odds_ratio'] = result.params conf.columns = ['2.5%', '97.5%', 'odds_ratio'] print np.exp(conf)
2.5% 97.5% odds_ratio welcome_email 0.999996 1.007682 1.003832 newsletter_email 1.094853 1.096658 1.095755 targeted_email 1.385575 1.412216 1.398832 intercept 0.094992 0.096362 0.095675
So what does our results mean? What does a 0.999 odds ratio coefficient for welcome emails mean?
Odds Ratios [p/(1-p)] The odds ratio is the probability of event occuring divided by the probability of event not occuring
Confidence Interval The confidence interval gives us an idea of the range of values we can expect (or how confidence we are that the odds ratio is within the calculated range).
The logit results indicated that welcome emails had a
p-value of 0.05. This means that if the welcome emails had no effect on login, we'd obtain similar user behavior in 5% of our data due to random sampling error. The same logic applies to the other variables.
What about the affect of multiple email types to login? What would be the predicted probabilities of logins? We'll cover this topic next.