Linear model with single predictor 🍡

# Linear model with single predictor<br> 🍡
### Dr. Çetinkaya-Rundel

---

layout: true
  
<div class="my-footer">
<span>
Dr. Mine Çetinkaya-Rundel -
<a href="https://introds.org" target="_blank">introds.org
</a>
</span>
</div>

---

# Interpreting models

---

## Height & width

```r
m_ht_wt <- lm(Height_in ~ Width_in, data = pp)
tidy(m_ht_wt)
```

```
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    3.62    0.254        14.3 8.82e-45
## 2 Width_in       0.781   0.00950      82.1 0.
```
]

`$$\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}$$`

--
- **Slope:** For each additional inch the painting is wider, the height is 
expected to be higher, on average, by 0.78 inches.

--
- **Intercept:** Paintings that are 0 inches wide are expected to be 3.62 
inches high, on average. The intercept is meaningless in the context of these 
data, it only serves to adjust the height of the line.

---

## Linear model with a single predictor

- We're interested in `$\beta_0$` (population parameter for the intercept) and 
`$\beta_1$` (population parameter for the slope) in the following model:

$$ \hat{y} = \beta_0 + \beta_1~x $$

--
- But we don't have access to the population, so we use sample statistics to estimate them:

$$ \hat{y} = b_0 + b_1~x $$

--
- Another way of describing this model is

$$y = b_0 + b_1~x + e $$

---

## Least squares regression

- The regression line minimizes the sum of squared residuals.

--
- If `$e_i = y - \hat{y}$`, then, the regression line minimizes 
`$\sum_{i = 1}^n e_i^2$`.

---

## Properties of the least squares regression line

- The regression line goes through the center of mass point, the coordinates corresponding to average `$x$` and average `$y$`: `$(\bar{x}, \bar{y})$`:  
`$$\hat{y} = b_0 + b_1 x ~ \rightarrow ~ b_0 = \hat{y} - b_1 x$$`

--
- The slope has the same sign as the correlation coefficient: `$b_1 = r \frac{s_y}{s_x}$`

--
- The sum of the residuals is zero: `$\sum_{i = 1}^n e_i = 0$`

--
- The residuals and `$x$` values are uncorrelated.

---

## Height & landscape features

```r
m_ht_lands <- lm(Height_in ~ factor(landsALL), data = pp)
tidy(m_ht_lands)
```

```
## # A tibble: 2 x 5
##   term              estimate std.error statistic  p.value
##   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)          22.7      0.328      69.1 0.      
## 2 factor(landsALL)1    -5.65     0.532     -10.6 7.97e-26
```

`$$\widehat{Height_{in}} = 22.68 - 5.65~landsALL$$`

---

## Height & landscape features (cont.)

- **Slope:** Paintings with landscape features are expected, on average,
to be 5.65 inches shorter than paintings that without landscape features.
    - Compares baseline level (`landsALL = 0`) to other level
    (`landsALL = 1`).
- **Intercept:** Paintings that don't have landscape features are expected, on 
average, to be 22.68 inches tall.

---

## Categorical predictor with 2 levels

```
## # A tibble: 8 x 3
##   name     price landsALL
##   <chr>    <dbl>    <dbl>
## 1 L1764-2    360        0
## 2 L1764-3      6        0
## 3 L1764-4     12        1
## 4 L1764-5a     6        1
## 5 L1764-5b     6        1
## 6 L1764-6      9        0
## 7 L1764-7a    12        0
## 8 L1764-7b    12        0
```

---

## Relationship between height and school

```r
m_ht_sch <- lm(Height_in ~ school_pntg, data = pp)
tidy(m_ht_sch)
```

```
## # A tibble: 7 x 5
##   term            estimate std.error statistic p.value
##   <chr>              <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)        14.        10.0     1.40  0.162  
## 2 school_pntgD/FL     2.33      10.0     0.232 0.816  
## 3 school_pntgF       10.2       10.0     1.02  0.309  
## 4 school_pntgG        1.65      11.9     0.139 0.889  
## 5 school_pntgI       10.3       10.0     1.02  0.306  
## 6 school_pntgS       30.4       11.4     2.68  0.00744
## 7 school_pntgX        2.87      10.3     0.279 0.780
```

---

## Categorical explanatory variables with 3+ levels

- When the categorical explanatory variable has many levels, they're encoded to
**dummy variables**.
- Each coefficient describes the expected difference between heights in that 
particular school compared to the baseline level.

---

.question[
`landsALL` had two levels: 0 and 1. What was the baseline level for the model with `landsALL` as predictor?
]

---

.question[
`school` has 7 levels: A = Austrian, D/FL = Dutch/Flemish, F = French, G = German, I = Italian, S = Spanish, X = Unknown. What us the baseline level for the model with `school` as predictor?
]

---

## Dummy coding

```
## # A tibble: 7 x 7
## # Groups:   school_pntg [7]
##   school_pntg  D_FL     F     G     I     S     X
##   <chr>       <int> <int> <int> <int> <int> <int>
## 1 A               0     0     0     0     0     0
## 2 D/FL            1     0     0     0     0     0
## 3 F               0     1     0     0     0     0
## 4 G               0     0     1     0     0     0
## 5 I               0     0     0     1     0     0
## 6 S               0     0     0     0     1     0
## 7 X               0     0     0     0     0     1
```

---

## Interpretation of coefficients

---

## The linear model with multiple predictors

- Population model:

$$ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k $$

- Sample model that we use to estimate the population model:
  
$$ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k $$

---

## Correlation does not imply causation!

Remember this when interpreting model coefficients

<br>

---

# Prediction with models

---

## Predict height from width

.question[
On average, how tall are paintings that are 60 inches wide?  
`$\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}$`
]

```r
3.62 + 0.78 * 60
```

```
## [1] 50.42
```

"On average, we expect paintings that are 60 inches wide to be 50.42 inches high."

**Warning:** We "expect" this to happen, but there will be some variability. 
(We'll learn about measuring the variability around the prediction later.)

---

## Prediction in R

Use the `predict()` function to make a prediction:

- First argument: model
- Second argument: data frame representing the new observation(s) for which 
you want to make a prediction, with variables with the same names as those 
to build the model, except for the response variables

---

## Prediction in R (cont.)

- Making a prediction for a single new observation:

```r
newpainting <- tibble(Width_in = 60)
predict(m_ht_wt, newdata = newpainting)
```

```
##        1 
## 50.46919
```

- Making a prediction for multiple new observations:

```r
newpaintings <- tibble(Width_in = c(50, 60, 70))
predict(m_ht_wt, newdata = newpaintings)
```

```
##        1        2        3 
## 42.66123 50.46919 58.27715
```

---

## Prediction vs. extrapolation

.question[
On average, how tall are paintings that are 400 inches wide?
`$$\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}$$`
]

---

## Watch out for extrapolation!

> "When those blizzards hit the East Coast this winter, it proved to my satisfaction 
that global warming was a fraud. That snow was freezing cold. But in an alarming 
trend, temperatures this spring have risen. Consider this: On February 6th it was 10 
degrees. Today it hit almost 80. At this rate, by August it will be 220 degrees. So 
clearly folks the climate debate rages on."<sup>1</sup>  <br>
Stephen Colbert, April 6th, 2010

---

# Measuring model fit

---

## Measuring the strength of the fit

- The strength of the fit of a linear model is most commonly evaluated using `$R^2$`.
- It tells us what percent of variability in the response variable is explained by 
the model.
- The remainder of the variability is explained by variables not included in the 
model.

---

## Height vs. lanscape features

```r
m_ht_lands <- lm(Height_in ~ factor(landsALL), data = pp)
glance(m_ht_lands)$r.squared
```

```
## [1] 0.03456724
```
]

**(a)** 3.5% of the variability in heights of painting can be explained by 
whether or not the painting has landscape features.

**(b)** 3.5% of the variability in whether a painting has landscape features 
can be explained by heights of paintings.

**(c)** This model predicts 3.5% of heights of paintings accurately.

**(d)** This model predicts 3.5% of whether a painting has landscape features 
accurately.

**(e)** 3.5% of the time there is an association between whether a painting 
has landscape features and its height.

---

## Height vs. width

```r
glance(m_ht_wt)
```

```
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>   <dbl>  <dbl>
## 1     0.683         0.683  8.30     6749.       0     2 -11083. 22173.
## # … with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>
```

```r
glance(m_ht_wt)$r.squared # extract R-squared
```

```
## [1] 0.6829468
```
]

Roughly 68% of the variability in heights of paintings can be explained by their
widths.

---

# Tidy regression output

---

## Not-so-tidy regression output

- You might come accross these in your googling adventures, but we'll try to stay away from them
- Not because they are wrong, but because they don't result in tidy data frames as results.

---

## Not-so-tidy regression output (1)

#### Option 1:

```r
m_ht_wt
```

```
## 
## Call:
## lm(formula = Height_in ~ Width_in, data = pp)
## 
## Coefficients:
## (Intercept)     Width_in  
##      3.6214       0.7808
```

---

## Not-so-tidy regression output (2)

#### Option 2:

```r
summary(m_ht_wt)
```

```
## 
## Call:
## lm(formula = Height_in ~ Width_in, data = pp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -86.714  -4.384  -2.422   3.169  85.084 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.621406   0.253860   14.27   <2e-16
## Width_in    0.780796   0.009505   82.15   <2e-16
## 
## Residual standard error: 8.304 on 3133 degrees of freedom
##   (258 observations deleted due to missingness)
## Multiple R-squared:  0.6829,	Adjusted R-squared:  0.6828 
## F-statistic:  6749 on 1 and 3133 DF,  p-value: < 2.2e-16
```

---

## Review

--
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.

---

## Tidy regression output

Achieved with functions from the broom package:

- `tidy`: Constructs a data frame that summarizes the model's statistical findings: coefficient estimates, *standard errors, test statistics, p-values*.
- `glance`: Constructs a concise one-row summary of the model. This typically contains values such as `$R^2$`, adjusted `$R^2$`, *and residual standard error that are computed once for the entire model*.
- `augment`: Adds columns to the original data that was modeled. This includes predictions and residuals.

---

## We've already seen...

- Tidy your model's statistical findings

```r
tidy(m_ht_wt)
```

- Glance to assess model fit

```r
glance(m_ht_wt)
```

---

## Augment data with model results

New variables of note (for now):

- `.fitted`: Predicted value of the response variable
- `.resid`: Residuals

```r
augment(m_ht_wt) %>%
  slice(1:5)
```

```
## # A tibble: 5 x 10
##   .rownames Height_in Width_in .fitted .se.fit .resid    .hat .sigma
##   <chr>         <dbl>    <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>
## 1 1                37     29.5    26.7   0.166  10.3  3.99e-4   8.30
## 2 2                18     14      14.6   0.165   3.45 3.96e-4   8.31
## 3 3                13     16      16.1   0.158  -3.11 3.61e-4   8.31
## 4 4                14     18      17.7   0.152  -3.68 3.37e-4   8.31
## 5 5                14     18      17.7   0.152  -3.68 3.37e-4   8.31
## # … with 2 more variables: .cooksd <dbl>, .std.resid <dbl>
```
]

---

## Residuals plot

```r
m_ht_wt_aug <- augment(m_ht_wt)
ggplot(m_ht_wt_aug, mapping = aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, color = "gray", lty = 2) +
  labs(x = "Predicted height", y = "Residuals")
```

<img src="w7_d2-linear-model-single-predictor_files/figure-html/unnamed-chunk-20-1.png" width="1500" />
]

---

# Model checking

---

## "Linear" models

- We're fitting a "linear" model, which assumes a linear relationship between 
our explanatory and response variables.
- But how do we assess this?

---

## Looking for...

- Residuals distributed randomly around 0
- With no visible pattern along the x or y axes

---

## Not looking for...

### Fan shapes

---

## Not looking for...

### Residuals correlated with predicted values

---

## Not looking for...

### Groups of patterns

---

## Not looking for...

### Any patterns!

---

.question[
What patterns does the residuals plot reveal that should make us question 
whether a linear model is a good fit for modeling the relationship 
between height and width of paintings?
]