The dataset we're going to analyze contains login and email send rates, categorized by email type, by user. These days it's simple to track email sends to opens to clicks and have the call to action in the email be a login. But what if not all your emails have a call to action? What are the affects of multiple emails and multiple email types on user logins? Can you maximize login rate by the type of email you send your users?

In this exercise, we'll take a look at three different types of emails and analyze their probabilities for a login.

In [ ]:

```
# import libraries
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
```

In [ ]:

```
# load the data into a DataFrame
df = pd.read_csv('emails.csv')
```

In [31]:

```
# ensure that the dataset loaded successfully
print df.head()
```

In [32]:

```
# remove the user_id column since it's not needed for our regression
del df['user_id']
```

In [33]:

```
# print out some descriptive statistics to get a sense of the data
print df.describe()
```

Our independent variables are a measure of the number of emails a user received, categorized by the type of email. But what if the independent variable contains category types? For example, what if we had an independent cateogry called `prestige`

that describes the prestige of the college a student is admitted into? Data within `prestige`

is a number between one and four, indicating the level of prestige for a given college. Our logit regression won't know how to deal with the numbers 1 to 4, so we would need to separate `prestige`

into four independent variables, with 1s and 0s for each prestige level. This way a student would be given a 1 or 0 for the variables `prestige_1`

, `prestige_2`

, `prestige_3`

, and `prestige_4`

.

Fortuntely, this is easy using `pandas`

. We can use the `get_dummies`

function to create a new `DataFrame`

that breaks `prestige`

up into four columns with binary values for each category.

For more information, go here: http://blog.yhat.com/posts/logistic-regression-python-rodeo.html

In [34]:

```
# since we're using statsmodels, we need to manually add the intercept
df['intercept'] = 1.0
```

In [36]:

```
# identify the independent variables
ind_vars = df.columns[1:]
# train on dependent variables
logit = sm.Logit(df['login'], df[ind_vars])
# fit the model and print out the output summary
result = logit.fit()
print result.summary()
```

The output gives you a breakdown of the input parameters and output analysis. A sanity check reveils that the dependent variable was login and that there were over 1 million observations. We also see the list of independent variables (email types), their coefficients, significance, and confidence interval.

It's difficult to make sense of the coefficients because it's presented as the coefficients for the natural log of the odds ratio:

```
ln[p/(1-p)] = a + BX + e
```

So in order to better understand how a one unit increase or decrease in email type affects the odds of a user logging in, we'll need to take the exponential of the coefficients:

```
[p/(1-p)] = exp(a + BX + e)
```

Let's also add in the confidence interval so that we have an understanding of the range of values.

In [37]:

```
# output the odds ratios and 95% CI
conf = result.conf_int()
conf['odds_ratio'] = result.params
conf.columns = ['2.5%', '97.5%', 'odds_ratio']
print np.exp(conf)
```

So what does our results mean? What does a 0.999 odds ratio coefficient for welcome emails mean?

**Odds Ratios [p/(1-p)]**
The odds ratio is the probability of event occuring divided by the probability of event not occuring

- welcome email: 0.99 = 1/1 = so the probability of a welcome email leading to a user login is 50%
- newsletter email: 1.09 = 1/1 = so the probabilty of a newsletter leading to a user login is 50%
- targeted email: 1.4 = 7/5 = so the probabilty of a targeted email leading to a user login is 58%

**Confidence Interval**
The confidence interval gives us an idea of the range of values we can expect (or how confidence we are that the odds ratio is within the calculated range).

**Signifiance (P-value)**
The logit results indicated that welcome emails had a `p-value`

of 0.05. This means that if the welcome emails had no effect on login, we'd obtain similar user behavior in 5% of our data due to random sampling error. The same logic applies to the other variables.