Logistic regression ✌️

# Logistic regression <br> ✌️
### Dr. Çetinkaya-Rundel

---

layout: true
  
<div class="my-footer">
<span>
Dr. Mine Çetinkaya-Rundel -
<a href="https://introds.org" target="_blank">introds.org
</a>
</span>
</div>

---

## Announcements

- Student hours this week Tuesday 14:30 - 15:30 (JCMB 2257) and Wednesday 14:30 - 15:30 (online: [bit.ly/ids-zoom](http://bit.ly/ids-zoom))
- Developing a narrative for your project... [[Watch video]](https://www.youtube.com/watch?reload=9&v=vGUNqq3jVLg)

---

# Predicting categorical data

---

# Spam filters

We will examine a data set of emails where we are interested in identifying 
spam messages.

- Data from 3921 emails and 21 variables on them.
- The outcome is whether the email is spam or not.
- Explanatory variables are number of characters, whether the word inherit 
was in the email, number of times the word inherit shows up in the email, etc.

---

---

.question[
Would you expect emails that have subjects starting with "Re:", "RE:", "re:", or "rE:" to be more likely to be spam or not?
]

<img src="w10_d1-logistic-regression_files/figure-html/unnamed-chunk-4-1.png" width="1500" />
---

# Modeling spam

- It seems clear that both number of characters and whether the message has "re:" in the subject are somewhat related to whether the email is spam. How do we come up with a model that will let us explore this relationship?

--
- For simplicity, we'll focus on the number of character (`num_char`) as the explanatory variable, but the model we describe can be expanded to take multiple explanatory variables as well.

---

# Modeling spam

Even if we set not spam to 0 and spam to 1, this isn’t something we can 
reasonably fit a linear model to - we need something more.

---

# Framing the problem

- We can treat each outcome (spam and not) as successes and failures arising 
from separate Bernoulli trials
  - Bernoulli trial: a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is conducted

--
- Each Bernoulli trial can have a separate probability of success

$$ y_i ∼ Bern(p) $$

--
- We can then use the predictor variables to model that probability of success, `$p_i$`

--
- We can’t just use a linear model for `$p_i$` (since `$p_i$` must be between 0 
and 1) but we can transform the linear model to have the appropriate range

---

## Generalized linear models

- It turns out that this is a very general way of addressing many
problem in regression, and the resulting models are called
generalized linear models (GLMs)

--
- Logistic regression is just one example.

---

## Three characteristics of GLMs

All generalized linear models have the following three characteristics

1.  A probability distribution describing a generative model for the 
outcome variable

--
2. A linear model:
$$ \eta = \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k $$

--
3. A link function that relates the linear model to the parameter of the 
outcome distribution
  
---

## Logistic regression

- Logistic regression is a GLM used to model a binary categorical outcome using numerical and categorical predictors.

--
- To finish specifying the Logistic model we just need to define a reasonable link function that connects `$\eta_i$` to `$p_i$`. There are a variety of options but the most commonly used is the logit function.

--
- **Logit function:**

$$ logit(p) = \log\left(\frac{p}{1-p}\right),\text{ for `$0\le p \le 1$`} $$

---

---

## Properties of the logit

- The logit function takes a value between 0 and 1 and maps it to a value between `$-\infty$` and `$\infty$`.

--
- Inverse logit (logistic) function:
`$$g^{-1}(x) = \frac{\exp(x)}{1+\exp(x)} = \frac{1}{1+\exp(-x)}$$`

--
- The inverse logit function takes a value between `$-\infty$` and `$\infty$` and maps it to a value between 0 and 1.

--
- This formulation is also useful for interpreting the model, since the logit can be interpreted as the log odds of a success - more on this later.

---

## The logistic regression model

- The three GLM criteria give us:
  - `$y_i \sim \text{Bern}(p_i)$`
  - `$\eta_i = \beta_0+\beta_1 x_{1,i} + \cdots + \beta_n x_{n,i}$`
  - `$\text{logit}(p_i) = \eta_i$`

--
- From which we get,

`$$p_i = \frac{\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}{1+\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}$$`
---

## Modeling spam

- In R we fit a GLM in the same way as a linear model except we
use `glm()` instead of `lm()`.

--
- We specify the type of GLM to fit using the `family` argument.

```r
spam_model <- glm(spam ~ num_char, data = email, family = "binomial")
tidy(spam_model)
```

```
## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  -1.80     0.0716     -25.1  2.04e-139
## 2 num_char     -0.0621   0.00801     -7.75 9.50e- 15
```

---

## Spam model

```r
tidy(spam_model)
```

Model:
`$$\log\left(\frac{p}{1-p}\right) = -1.80-0.0621\times \text{num_char}$$`

---

## P(spam) for an email with 2000 characters

`$$\log\left(\frac{p}{1-p}\right) = -1.80-0.0621\times 2$$`
--
`$$\frac{p}{1-p} = \exp(-1.9242) = 0.15 \rightarrow p = 0.15 \times (1 - p)$$`
--
`$$p = 0.15 - 0.15p \rightarrow 1.15p = 0.15$$`
--
`$$p = 0.15 / 1.15 = 0.13$$`

---

.question[
What is the probability that an email with 15000 characters is spam? What about 
an email with 40000 characters?
]

---

---

.question[
Would you prefer an email with 2000 characters to be labeled as spam or not? How about 40,000 characters?
]

---

# Sensitivity and specificity

---

|                         | Email is spam                 | Email is not spam             |
|-------------------------|-------------------------------|-------------------------------|
| Email labelled spam     | True positive                 | False positive (Type 1 error) |
| Email labelled not spam | False negative (Type 2 error) | True negative                 |

--
- False negative rate = P(Labelled not spam | Email spam) = FN / (TP + FN) 
- False positive rate = P(Labelled spam | Email not spam) = FP / (FP + TN)

--
- Sensitivity = P(Labelled spam | Email spam) = TP / (TP + FN)
  - Sensitivity = 1 − False negative rate
- Specificity = P(Labelled not spam | Email not spam) = TN / (FP + TN) 
  - Specificity = 1 − False positive rate

---

.question[
If you were designing a spam filter, would you want sensitivity and specificity to be high or low? What are the tradeoffs associated with each decision? 
]

---

## Using logistic regression to build a spam filter

- We have a set of emails we are interested in identifying spam messages.
- Using logistic regression we can predict the probability an incoming message is spam. 
- Using model selection, we can pick a model with the highest predictive power.

--
- But when designing a spam filter this is only half of the battle, we would also need to design a decision rule about which emails get flagged as spam (e.g. what probability should we use as out cutoff?)

--
- While not the only possible solution, we can consider a simple approach where we choose a single threshold probability and any email that exceeds that probability is flagged as spam.