Model validation and logistic regression ✅

# Model validation and logistic regression <br> ✅
### Dr. Çetinkaya-Rundel

---

layout: true
  
<div class="my-footer">
<span>
Dr. Mine Çetinkaya-Rundel -
<a href="https://introds.org" target="_blank">introds.org
</a>
</span>
</div>

---

## Announcements

- Office hours later today online
- Peer evaluations -> team meetings

---

# Model validation

---

## Overfitting

- The data we are using to construct our models come from a larger population.
- Ultimately we want our model to tell us how the population works, not just the sample we have.
- If the model we fit is too tailored to our sample, it might not perform as well with the remaining population. This means the model is "overfitting" our data.
- We measure this using **model validation** techniques.
- Note: Overfitting is not a huge concern with linear models with low level 
interactions, however it can be with more complex and flexible models. The 
following is just an example of model validation, even though we're using it 
in a scenario where the concern for overfitting is not high.

---

## Model validation

- One commonly used model validation technique is to partition your data 
into training and testing set
- That is, fit the model on the training data
- And test it on the testing data
- Evaluate model performance using `$RMSE$`, root-mean squared error

$$ RMSE = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}} $$

---

## Random sampling and reproducibility

Gotta set a seed!

```r
set.seed(3518)
```

- We will be taking random samples, but we want the analysis to be 
reproducible (across different people running the sama analysis as well as 
for the same person running the analysis at different times)
- So we need to tell R where to start the (pseudo) random number generation

---

## Cross validation

More specifically, **k-fold cross validation**:

- Split your data into k folds
- Use 1 fold for testing and the remaining (k - 1) folds for training
- Repeat k times

---

## Aside -- the modulo operator

```r
9 %% 5
```

```
## [1] 4
```

```
## # A tibble: 8 x 2
##     obs  fold
##   <int> <int>
## 1     1     1
## 2     2     2
## 3     3     3
## 4     4     4
## 5     5     5
## 6     6     1
## 7     7     2
## 8     8     3
```
]

```r
(1:8) %% 5
```

```
## [1] 1 2 3 4 0 1 2 3
```

```r
((1:8) - 1) %% 5
```

```
## [1] 0 1 2 3 4 0 1 2
```

```r
(((1:8) - 1) %% 5) + 1
```

```
## [1] 1 2 3 4 5 1 2 3
```
]
]

---

## Prepping your data for 5-fold CV

```r
evals_cv <- evals %>%
  sample_n(nrow(evals)) %>%               # scramble rows
  rowid_to_column() %>%                   # add id column
  mutate( fold = ((rowid - 1) %% 5) + 1 ) # add fold column

evals_cv %>% 
  count(fold)
```

```
## # A tibble: 5 x 2
##    fold     n
##   <dbl> <int>
## 1     1    93
## 2     2    93
## 3     3    93
## 4     4    92
## 5     5    92
```

---

## CV 1

```r
test_fold <- 1
test <- evals_cv %>% filter(fold == test_fold)
train <- evals_cv %>% filter(fold != test_fold)
mod <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval + 
    cls_credits + bty_avg, data = train)
(rmse_test1 <- rmse(mod, test))
```

```
## [1] 0.5145405
```

---

## RMSE on training vs. testing

.question[
Would you expect the RMSE to be higher for your training data or your testing data? Why?
]

---

## RMSE on training vs. testing

RMSE for testing:
.small[

```r
(rmse_test1 <- rmse(mod, test))
```

```
## [1] 0.5145405
```
]

RMSE for training:
.small[

```r
(rmse_train1 <- rmse(mod, train))
```

```
## [1] 0.4955728
```
]

---

## CV 2

```r
test_fold <- 2
test <- evals_cv %>% filter(fold == test_fold)
train <- evals_cv %>% filter(fold != test_fold)
mod <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval + 
    cls_credits + bty_avg, data = train)
```

```r
(rmse_test2 <- rmse(mod, test))
```

```
## [1] 0.5436487
```

```r
(rmse_train2 <- rmse(mod, train))
```

```
## [1] 0.488579
```

---

## CV 3

```r
test_fold <- 3
test <- evals_cv %>% filter(fold == test_fold)
train <- evals_cv %>% filter(fold != test_fold)
mod <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval + 
    cls_credits + bty_avg, data = train)
```

```r
(rmse_test3 <- rmse(mod, test))
```

```
## [1] 0.4950412
```

```r
(rmse_train3 <- rmse(mod, train))
```

```
## [1] 0.5016374
```

---

## CV 4

```r
test_fold <- 4
test <- evals_cv %>% filter(fold == test_fold)
train <- evals_cv %>% filter(fold != test_fold)
mod <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval + 
    cls_credits + bty_avg, data = train)
```

```r
(rmse_test4 <- rmse(mod, test))
```

```
## [1] 0.5164744
```

```r
(rmse_train4 <- rmse(mod, train))
```

```
## [1] 0.4956887
```

---

## CV 5

```r
test_fold <- 5
test <- evals_cv %>% filter(fold == test_fold)
train <- evals_cv %>% filter(fold != test_fold)
mod <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval + 
    cls_credits + bty_avg, data = train)
```

```r
(rmse_test5 <- rmse(mod, test))
```

```
## [1] 0.47848
```

```r
(rmse_train5 <- rmse(mod, train))
```

```
## [1] 0.5052493
```

---

## Putting it altogether

---

## How does RMSE compare to y?

- `score` summary stats:

```
## # A tibble: 1 x 6
##     min   max  mean   med    sd   IQR
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1   2.3     5  4.17   4.3 0.544 0.800
```

- `rmse_test` summary stats:

```
## # A tibble: 1 x 6
##     min   max  mean   med     sd    IQR
##   <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>
## 1 0.478 0.544 0.510 0.515 0.0246 0.0214
```

---

# Prediction

---

## New observation

To make a prediction for a new observation we need to create a data frame with that observation.

.question[
Suppose we want to make a prediction for a 37 year old white woman professor who received her education at an English speaking country and who teaches a multi credit course. 80% of her classes tend to fill out evaluations, and she's average looiking (beauty score = 2.5).

The following won't work. Why? How would you correct it?
]

```r
new_prof <- data_frame(ethnicity = "white",
                       sex = "woman",
                       language = "English",
                       age = 35, 
                       cls_perc_eval = 0.80,
                       cls_credits = "multi-credit",
                       bty_avg = 2.5)
```

```
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.
```

---

## New observation, corrected

```r
new_prof <- data_frame(ethnicity = "not minority",
                       gender = "female",
                       language = "english",
                       age = 35, 
                       cls_perc_eval = 0.80,
                       cls_credits = "multi credit",
                       bty_avg = 2.5)
```

---

## Prediction

```r
predict(selected_model, newdata = new_prof)
```

```
##        1 
## 3.642981
```

---

## Uncertainty around prediction

Prediction interval around `$\hat{y}$` for new data (score for a prof with given characteristics):

```r
predict(selected_model, newdata = new_prof, interval = "prediction")
```

```
##        fit      lwr      upr
## 1 3.642981 2.626502 4.659461
```