class: center, middle, inverse, title-slide # Linear model with multiple predictors
🤹‍♀ ### Dr. Çetinkaya-Rundel --- layout: true <div class="my-footer"> <span> Dr. Mine Çetinkaya-Rundel - <a href="https://introds.org" target="_blank">introds.org </a> </span> </div> --- ## Announcements - Student hours later today: Wed, 14:30 - 15:30 @ [bit.ly/ids-zoom-week-08](http://bit.ly/ids-zoom-week-08) (Online!) --- class: center, middle # The linear model with multiple predictors --- ## Multiple predictors - Response variable: log(price) - Explanatory variables: Width and height ```r m_wi_hgt <- lm(log(price) ~ Width_in + Height_in, data = pp) tidy(m_wi_hgt) ``` ``` ## # A tibble: 3 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.77 0.0579 82.4 0. ## 2 Width_in 0.0269 0.00373 7.22 6.58e-13 ## 3 Height_in -0.0133 0.00395 -3.36 7.93e- 4 ``` --- ## Linear model with multiple predictors ``` ## # A tibble: 3 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.77 0.0579 82.4 0. ## 2 Width_in 0.0269 0.00373 7.22 6.58e-13 ## 3 Height_in -0.0133 0.00395 -3.36 7.93e- 4 ``` <br> `$$\widehat{log(price)} = 4.77 + 0.0269~width - 0.0133~height$$` --- ## Visualizing models with multiple predictors
--- class: center, middle # Exploration: Price, surface area, and living artist --- ## Price, surface area, and living artist - Explore the relationship between price of paintings and surface area, conditioned on whether or not the artist is still living - First visualize and explore, then model - But first, prep the data .midi[ ```r pp <- pp %>% mutate(artistliving = if_else(artistliving == 0, "Deceased", "Living")) pp %>% count(artistliving) ``` ``` ## # A tibble: 2 x 2 ## artistliving n ## <chr> <int> ## 1 Deceased 2937 ## 2 Living 456 ``` ] --- ## Typical surface area .question[ What is the typical surface area for paintings? ] <img src="w8_d2-linear-model-multiple-predictors_files/figure-html/viz-surf-artistliving-1.png" width="1500" /> -- Less than 1000 square inches (~ 80cm x 80cm). There are very few paintings that have surface area above 5000 square inches. --- ## Narrowing the scope For simplicity let's focus on the paintings with `Surface < 5000`: ```r pp_Surf_lt_5000 <- pp %>% filter(Surface < 5000) ``` <img src="w8_d2-linear-model-multiple-predictors_files/figure-html/viz-surf-lt-5000-artistliving-1.png" width="1500" /> --- ## Facet to get a better look <img src="w8_d2-linear-model-multiple-predictors_files/figure-html/viz-surf-lt-5000-artistliving-facet-1.png" width="1500" /> --- ## Two ways to model - **Main effects:** Assuming relationship between surface and logged price **does not vary** by whether or not the artist is living. - **Interaction effects:** Assuming relationship between surface and logged price **varies** by whether or not the artist is living. --- ## Interacting explanatory variables - Including an interaction effect in the model allows for different slopes, i.e. nonparallel lines. - This implies that the regression coefficient for an explanatory variable would change as another explanatory variable changes. - This can be accomplished by adding an interaction variable: the product of two explanatory variables. --- class: center, middle # Side-step: Weights of books --- .question[ Suppose we want to predict weight of books from their volume and cover type (hardback vs. paperback). Do you think a model with main effects or interaction effects is more appropriate? Explain your reasoning. **Hint:** Main effects would mean rate at which weight changes as volume increases would be the same for hardback and paperback books and interaction effects would mean the rate at which weight changes as volume increases would be different for hardback and paperback books. ]
03
:
00
--- The `allbacks` data frame gives measurements on the volume and weight of 15 books, some of which are softback and some of which are hardback. .small[ ```r library(DAAG) as_tibble(allbacks) ``` ``` ## # A tibble: 15 x 4 ## volume area weight cover ## <dbl> <dbl> <dbl> <fct> ## 1 885 382 800 hb ## 2 1016 468 950 hb ## 3 1125 387 1050 hb ## 4 239 371 350 hb ## 5 701 371 750 hb ## 6 641 367 600 hb ## 7 1228 396 1075 hb ## 8 412 0 250 pb ## 9 953 0 700 pb ## 10 929 0 650 pb ## 11 1492 0 975 pb ## 12 419 0 350 pb ## 13 1010 0 950 pb ## 14 595 0 425 pb ## 15 1034 0 725 pb ``` ] .footnote[ The bookshelf of J. H. Maindonald at Australian National University. ] --- ```r ggplot(allbacks, aes(x = volume, y = weight, color = cover)) + geom_point(alpha = 0.7) + theme_minimal() ``` <img src="w8_d2-linear-model-multiple-predictors_files/figure-html/unnamed-chunk-5-1.png" width="1500" /> --- <img src="w8_d2-linear-model-multiple-predictors_files/figure-html/unnamed-chunk-6-1.png" width="1800" /> <img src="w8_d2-linear-model-multiple-predictors_files/figure-html/unnamed-chunk-7-1.png" width="1800" /> --- ## In pursuit of Occam's razor - Occam's Razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected. - Model selection follows this principle. - We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model. - In other words, we prefer the simplest best model, i.e. **parsimonious** model. --- .question[ Visually, which of the two models is preferable under Occam's razor? ] <img src="w8_d2-linear-model-multiple-predictors_files/figure-html/unnamed-chunk-8-1.png" width="1800" /> <img src="w8_d2-linear-model-multiple-predictors_files/figure-html/unnamed-chunk-9-1.png" width="1800" /> --- ## R-squared - `\(R^2\)` is the percentage of variability in the response variable explained by the regression model. ```r glance(m_main)$r.squared ``` ``` ## [1] 0.9274776 ``` ```r glance(m_int)$r.squared ``` ``` ## [1] 0.9297137 ``` -- - Clearly the model with interactions has a higher `\(R^2\)`. -- - However using `\(R^2\)` for model selection in models with multiple explanatory variables is not a good idea as `\(R^2\)` increases when **any** variable is added to the model. --- ## Adjusted R-squared ... a (more) objective measure for model selection - Adjusted `\(R^2\)` doesn't increase if the new variable does not provide any new informaton or is completely unrelated, as it applies a penalty for number of variables included in the model. - This makes adjusted `\(R^2\)` a preferable metric for model selection in multiple regression models. --- ## Comparing models .pull-left[ ```r glance(m_main)$r.squared ``` ``` ## [1] 0.9274776 ``` ```r glance(m_int)$r.squared ``` ``` ## [1] 0.9297137 ``` ] .pull-right[ ```r glance(m_main)$adj.r.squared ``` ``` ## [1] 0.9153905 ``` ```r glance(m_int)$adj.r.squared ``` ``` ## [1] 0.9105447 ``` ] -- .small[ ```r # Is R-sq higher for int model? glance(m_int)$r.squared > glance(m_main)$r.squared ``` ``` ## [1] TRUE ``` ```r # Is R-sq adj. higher for int model? glance(m_int)$adj.r.squared > glance(m_main)$adj.r.squared ``` ``` ## [1] FALSE ``` ] --- class: center, middle # Back to exploration: Price, surface area, and living artist --- ## Two ways to model - **Main effects:** Assuming relationship between surface and logged price **does not vary** by whether or not the artist is living. - **Interaction effects:** Assuming relationship between surface and logged price **varies** by whether or not the artist is living. .pull-left[ <img src="w8_d2-linear-model-multiple-predictors_files/figure-html/viz-main-effects-1.png" width="1500" /> ] .pull-right[ <img src="w8_d2-linear-model-multiple-predictors_files/figure-html/viz-interaction-effects-1.png" width="1500" /> ] --- ## Fit model with main effects - Response variable: log(price) - Explanatory variables: Surface area and artist living (0/1 variable) .midi[ ```r m_main <- lm(log(price) ~ Surface + factor(artistliving), data = pp_Surf_lt_5000) tidy(m_main) ``` ``` ## # A tibble: 3 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.88 0.0424 115. 0. ## 2 Surface 0.000265 0.0000415 6.39 1.85e-10 ## 3 factor(artistliving)Living 0.137 0.0970 1.41 1.57e- 1 ``` ] -- $$ \widehat{log(price)} = 4.88 + 0.000265~surface + 0.137~artistliving $$ --- ## Solving the model - Non-living artist: Plug in 0 for `artistliving` `\(\widehat{log(price)} = 4.88 + 0.000265~surface + 0.137 \times 0\)` `\(= 4.88 + 0.000265~surface\)` -- - Living artist: Plug in 1 for `artistliving` `\(\widehat{log(price)} = 4.88 + 0.000265~surface + 0.137 \times 1\)` `\(= 5.017 + 0.000265~surface\)` --- ## Visualizing main effects <img src="w8_d2-linear-model-multiple-predictors_files/figure-html/unnamed-chunk-14-1.png" width="1500" /> - **Same slope:** Rate of change in price as the surface area increases does not vary between paintings by living and non-living artists. - **Different intercept:** Paintings by living artists are consistently more expensive than paintings by non-living artists. --- ## Interpreting main effects .midi[ ```r tidy(m_main) %>% mutate(exp_estimate = exp(estimate)) %>% select(term, estimate, exp_estimate) ``` ``` ## # A tibble: 3 x 3 ## term estimate exp_estimate ## <chr> <dbl> <dbl> ## 1 (Intercept) 4.88 132. ## 2 Surface 0.000265 1.00 ## 3 factor(artistliving)Living 0.137 1.15 ``` ] - All else held constant, for each additional square inch in painting's surface area, the price of the painting is predicted, on average, to be higher by a factor of 1. - All else held constant, paintings by a living artist are predicted, on average, to be higher by a factor of 1.15 compared to paintings by an artist who is no longer alive. - Paintings that are by an artist who is not alive and that have a surface area of 0 square inches are predicted, on average, to be 132 livres. --- ## .question[ Why is our linear regression model different from what we got from `geom_smooth(method = "lm")`? ] .pull-left[ <img src="w8_d2-linear-model-multiple-predictors_files/figure-html/unnamed-chunk-15-1.png" width="1500" /> ] .pull-right[ <img src="w8_d2-linear-model-multiple-predictors_files/figure-html/viz-main-effects3-1.png" width="1500" /> ] --- ## What went ~wrong~ diferently? - The way we specified our model only lets `artistliving` affect the intercept. - Model implicitly assumes that paintings with living and deceased artists have the *same slope* and only allows for *different intercepts*. .question[ What seems more appropriate in this case? + Same slope and same intercept for both colors + Same slope and different intercept for both colors + Different slope and different intercept for both colors ] --- ## Interaction: surface * artist living <img src="w8_d2-linear-model-multiple-predictors_files/figure-html/unnamed-chunk-16-1.png" width="1500" /> --- ## Fit model with interaction effects - Response variable: log(price) - Explanatory variables: Surface area, artist living (0/1 variable), and their interaction .midi[ ```r m_int <- lm(log(price) ~ Surface + factor(artistliving) + Surface * factor(artistliving), data = pp_Surf_lt_5000) tidy(m_int) ``` ``` ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.91 0.0432 114. 0. ## 2 Surface 0.000206 0.0000442 4.65 3.37e-6 ## 3 factor(artistliving)Living -0.126 0.119 -1.06 2.89e-1 ## 4 Surface:factor(artistliving)Livi… 0.000479 0.000126 3.81 1.39e-4 ``` ] --- ## Linear model with interaction effects .midi[ ``` ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.91 0.0432 114. 0. ## 2 Surface 0.000206 0.0000442 4.65 3.37e-6 ## 3 factor(artistliving)Living -0.126 0.119 -1.06 2.89e-1 ## 4 Surface:factor(artistliving)Livi… 0.000479 0.000126 3.81 1.39e-4 ``` ] $$ \widehat{log(price)} = 4.91 + 0.00021~surface - 0.126~artistliving $$ $$+ ~ 0.00048~surface \times artistliving $$ --- ## Interpretation of interaction effects - Rate of change in price as the surface area of the painting increases does vary between paintings by living and non-living artists (different slopes), - Some paintings by living artists are more expensive than paintings by non-living artists, and some are not (different intercept). .small[ .pull-left[ - Non-living artist: `\(\widehat{log(price)} = 4.91 + 0.00021~surface\)` `\(- 0.126 \times 0 + 0.00048~surface \times 0\)` `\(= 4.91 + 0.00021~surface\)` - Living artist: `\(\widehat{log(price)} = 4.91 + 0.00021~surface\)` `\(- 0.126 \times 1 + 0.00048~surface \times 1\)` `\(= 4.91 + 0.00021~surface\)` `\(- 0.126 + 0.00048~surface\)` `\(= 4.784 + 0.00069~surface\)` ] .pull-right[ <img src="w8_d2-linear-model-multiple-predictors_files/figure-html/viz-interaction-effects2-1.png" width="1500" /> ] ] --- ## Comparing models It appears that adding the interaction actually increased adjusted `\(R^2\)`, so we should indeed use the model with the interactions. ```r glance(m_main)$adj.r.squared ``` ``` ## [1] 0.01258977 ``` ```r glance(m_int)$adj.r.squared ``` ``` ## [1] 0.01676753 ``` --- ## Third order interactions - Can you? Yes - Should you? Probably not if you want to interpret these interactions in context of the data.