class: center, middle, inverse, title-slide # Logistic regression
✌️ ### Dr. Çetinkaya-Rundel --- layout: true <div class="my-footer"> <span> Dr. Mine Çetinkaya-Rundel - <a href="https://introds.org" target="_blank">introds.org </a> </span> </div> --- ## Announcements - Student hours this week Tuesday 14:30 - 15:30 (JCMB 2257) and Wednesday 14:30 - 15:30 (online: [bit.ly/ids-zoom](http://bit.ly/ids-zoom)) - Developing a narrative for your project... [[Watch video]](https://www.youtube.com/watch?reload=9&v=vGUNqq3jVLg) --- class: center, middle # Predicting categorical data --- # Spam filters We will examine a data set of emails where we are interested in identifying spam messages. - Data from 3921 emails and 21 variables on them. - The outcome is whether the email is spam or not. - Explanatory variables are number of characters, whether the word inherit was in the email, number of times the word inherit shows up in the email, etc. --- .question[ Would you expect longer or shorter emails to be spam? ] -- <img src="w10_d1-logistic-regression_files/figure-html/unnamed-chunk-3-1.png" width="1500" /> --- .question[ Would you expect emails that have subjects starting with "Re:", "RE:", "re:", or "rE:" to be more likely to be spam or not? ] -- <img src="w10_d1-logistic-regression_files/figure-html/unnamed-chunk-4-1.png" width="1500" /> --- # Modeling spam - It seems clear that both number of characters and whether the message has "re:" in the subject are somewhat related to whether the email is spam. How do we come up with a model that will let us explore this relationship? -- - For simplicity, we'll focus on the number of character (`num_char`) as the explanatory variable, but the model we describe can be expanded to take multiple explanatory variables as well. --- # Modeling spam Even if we set not spam to 0 and spam to 1, this isn’t something we can reasonably fit a linear model to - we need something more. <img src="w10_d1-logistic-regression_files/figure-html/unnamed-chunk-5-1.png" width="1500" /> --- # Framing the problem - We can treat each outcome (spam and not) as successes and failures arising from separate Bernoulli trials - Bernoulli trial: a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is conducted -- - Each Bernoulli trial can have a separate probability of success $$ y_i ∼ Bern(p) $$ -- - We can then use the predictor variables to model that probability of success, `\(p_i\)` -- - We can’t just use a linear model for `\(p_i\)` (since `\(p_i\)` must be between 0 and 1) but we can transform the linear model to have the appropriate range --- ## Generalized linear models - It turns out that this is a very general way of addressing many problem in regression, and the resulting models are called generalized linear models (GLMs) -- - Logistic regression is just one example. --- ## Three characteristics of GLMs All generalized linear models have the following three characteristics 1. A probability distribution describing a generative model for the outcome variable -- 2. A linear model: $$ \eta = \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k $$ -- 3. A link function that relates the linear model to the parameter of the outcome distribution --- ## Logistic regression - Logistic regression is a GLM used to model a binary categorical outcome using numerical and categorical predictors. -- - To finish specifying the Logistic model we just need to define a reasonable link function that connects `\(\eta_i\)` to `\(p_i\)`. There are a variety of options but the most commonly used is the logit function. -- - **Logit function:** $$ logit(p) = \log\left(\frac{p}{1-p}\right),\text{ for `\(0\le p \le 1\)`} $$ --- <img src="w10_d1-logistic-regression_files/figure-html/unnamed-chunk-6-1.png" width="1500" /> --- ## Properties of the logit - The logit function takes a value between 0 and 1 and maps it to a value between `\(-\infty\)` and `\(\infty\)`. -- - Inverse logit (logistic) function: `$$g^{-1}(x) = \frac{\exp(x)}{1+\exp(x)} = \frac{1}{1+\exp(-x)}$$` -- - The inverse logit function takes a value between `\(-\infty\)` and `\(\infty\)` and maps it to a value between 0 and 1. -- - This formulation is also useful for interpreting the model, since the logit can be interpreted as the log odds of a success - more on this later. --- ## The logistic regression model - The three GLM criteria give us: - `\(y_i \sim \text{Bern}(p_i)\)` - `\(\eta_i = \beta_0+\beta_1 x_{1,i} + \cdots + \beta_n x_{n,i}\)` - `\(\text{logit}(p_i) = \eta_i\)` -- - From which we get, `$$p_i = \frac{\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}{1+\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}$$` --- ## Modeling spam - In R we fit a GLM in the same way as a linear model except we use `glm()` instead of `lm()`. -- - We specify the type of GLM to fit using the `family` argument. ```r spam_model <- glm(spam ~ num_char, data = email, family = "binomial") tidy(spam_model) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -1.80 0.0716 -25.1 2.04e-139 ## 2 num_char -0.0621 0.00801 -7.75 9.50e- 15 ``` --- ## Spam model ```r tidy(spam_model) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -1.80 0.0716 -25.1 2.04e-139 ## 2 num_char -0.0621 0.00801 -7.75 9.50e- 15 ``` -- Model: `$$\log\left(\frac{p}{1-p}\right) = -1.80-0.0621\times \text{num_char}$$` --- ## P(spam) for an email with 2000 characters `$$\log\left(\frac{p}{1-p}\right) = -1.80-0.0621\times 2$$` -- `$$\frac{p}{1-p} = \exp(-1.9242) = 0.15 \rightarrow p = 0.15 \times (1 - p)$$` -- `$$p = 0.15 - 0.15p \rightarrow 1.15p = 0.15$$` -- `$$p = 0.15 / 1.15 = 0.13$$` --- .question[ What is the probability that an email with 15000 characters is spam? What about an email with 40000 characters? ]
05
:
00
--- <img src="w10_d1-logistic-regression_files/figure-html/unnamed-chunk-10-1.png" width="1500" /> --- .question[ Would you prefer an email with 2000 characters to be labeled as spam or not? How about 40,000 characters? ] <img src="w10_d1-logistic-regression_files/figure-html/unnamed-chunk-11-1.png" width="1500" /> --- class: center, middle # Sensitivity and specificity --- | | Email is spam | Email is not spam | |-------------------------|-------------------------------|-------------------------------| | Email labelled spam | True positive | False positive (Type 1 error) | | Email labelled not spam | False negative (Type 2 error) | True negative | -- - False negative rate = P(Labelled not spam | Email spam) = FN / (TP + FN) - False positive rate = P(Labelled spam | Email not spam) = FP / (FP + TN) -- - Sensitivity = P(Labelled spam | Email spam) = TP / (TP + FN) - Sensitivity = 1 − False negative rate - Specificity = P(Labelled not spam | Email not spam) = TN / (FP + TN) - Specificity = 1 − False positive rate --- .question[ If you were designing a spam filter, would you want sensitivity and specificity to be high or low? What are the tradeoffs associated with each decision? ]
03
:
00
--- ## Using logistic regression to build a spam filter - We have a set of emails we are interested in identifying spam messages. - Using logistic regression we can predict the probability an incoming message is spam. - Using model selection, we can pick a model with the highest predictive power. -- - But when designing a spam filter this is only half of the battle, we would also need to design a decision rule about which emails get flagged as spam (e.g. what probability should we use as out cutoff?) -- - While not the only possible solution, we can consider a simple approach where we choose a single threshold probability and any email that exceeds that probability is flagged as spam.