m_ht_wt <- lm(Height_in ~ Width_in, data = pp)tidy(m_ht_wt)
## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 3.62 0.254 14.3 8.82e-45## 2 Width_in 0.781 0.00950 82.1 0.
m_ht_wt <- lm(Height_in ~ Width_in, data = pp)tidy(m_ht_wt)
## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 3.62 0.254 14.3 8.82e-45## 2 Width_in 0.781 0.00950 82.1 0.
ˆHeightin=3.62+0.78 Widthin
m_ht_wt <- lm(Height_in ~ Width_in, data = pp)tidy(m_ht_wt)
## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 3.62 0.254 14.3 8.82e-45## 2 Width_in 0.781 0.00950 82.1 0.
ˆHeightin=3.62+0.78 Widthin
m_ht_wt <- lm(Height_in ~ Width_in, data = pp)tidy(m_ht_wt)
## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 3.62 0.254 14.3 8.82e-45## 2 Width_in 0.781 0.00950 82.1 0.
ˆHeightin=3.62+0.78 Widthin
^y=β0+β1 x
^y=β0+β1 x
^y=b0+b1 x
^y=β0+β1 x
^y=b0+b1 x
y=b0+b1 x+e
m_ht_lands <- lm(Height_in ~ factor(landsALL), data = pp)tidy(m_ht_lands)
## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 22.7 0.328 69.1 0. ## 2 factor(landsALL)1 -5.65 0.532 -10.6 7.97e-26
m_ht_lands <- lm(Height_in ~ factor(landsALL), data = pp)tidy(m_ht_lands)
## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 22.7 0.328 69.1 0. ## 2 factor(landsALL)1 -5.65 0.532 -10.6 7.97e-26
ˆHeightin=22.68−5.65 landsALL
landsALL = 0
) to other level
(landsALL = 1
).## # A tibble: 8 x 3## name price landsALL## <chr> <dbl> <dbl>## 1 L1764-2 360 0## 2 L1764-3 6 0## 3 L1764-4 12 1## 4 L1764-5a 6 1## 5 L1764-5b 6 1## 6 L1764-6 9 0## 7 L1764-7a 12 0## 8 L1764-7b 12 0
m_ht_sch <- lm(Height_in ~ school_pntg, data = pp)tidy(m_ht_sch)
## # A tibble: 7 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 14. 10.0 1.40 0.162 ## 2 school_pntgD/FL 2.33 10.0 0.232 0.816 ## 3 school_pntgF 10.2 10.0 1.02 0.309 ## 4 school_pntgG 1.65 11.9 0.139 0.889 ## 5 school_pntgI 10.3 10.0 1.02 0.306 ## 6 school_pntgS 30.4 11.4 2.68 0.00744## 7 school_pntgX 2.87 10.3 0.279 0.780
m_ht_sch <- lm(Height_in ~ school_pntg, data = pp)tidy(m_ht_sch)
## # A tibble: 7 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 14. 10.0 1.40 0.162 ## 2 school_pntgD/FL 2.33 10.0 0.232 0.816 ## 3 school_pntgF 10.2 10.0 1.02 0.309 ## 4 school_pntgG 1.65 11.9 0.139 0.889 ## 5 school_pntgI 10.3 10.0 1.02 0.306 ## 6 school_pntgS 30.4 11.4 2.68 0.00744## 7 school_pntgX 2.87 10.3 0.279 0.780
Where did all these new predictors come from?
## # A tibble: 7 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 14. 10.0 1.40 0.162 ## 2 school_pntgD/FL 2.33 10.0 0.232 0.816 ## 3 school_pntgF 10.2 10.0 1.02 0.309 ## 4 school_pntgG 1.65 11.9 0.139 0.889 ## 5 school_pntgI 10.3 10.0 1.02 0.306 ## 6 school_pntgS 30.4 11.4 2.68 0.00744## 7 school_pntgX 2.87 10.3 0.279 0.780
## # A tibble: 7 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 14. 10.0 1.40 0.162 ## 2 school_pntgD/FL 2.33 10.0 0.232 0.816 ## 3 school_pntgF 10.2 10.0 1.02 0.309 ## 4 school_pntgG 1.65 11.9 0.139 0.889 ## 5 school_pntgI 10.3 10.0 1.02 0.306 ## 6 school_pntgS 30.4 11.4 2.68 0.00744## 7 school_pntgX 2.87 10.3 0.279 0.780
What is the baseline level?
landsALL
had two levels: 0 and 1. What was the baseline level for the model with landsALL
as predictor?
## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 22.7 0.328 69.1 0. ## 2 factor(landsALL)1 -5.65 0.532 -10.6 7.97e-26
school
has 7 levels: A = Austrian, D/FL = Dutch/Flemish, F = French, G = German, I = Italian, S = Spanish, X = Unknown. What us the baseline level for the model with school
as predictor?
## # A tibble: 7 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 14. 10.0 1.40 0.162 ## 2 school_pntgD/FL 2.33 10.0 0.232 0.816 ## 3 school_pntgF 10.2 10.0 1.02 0.309 ## 4 school_pntgG 1.65 11.9 0.139 0.889 ## 5 school_pntgI 10.3 10.0 1.02 0.306 ## 6 school_pntgS 30.4 11.4 2.68 0.00744## 7 school_pntgX 2.87 10.3 0.279 0.780
## # A tibble: 7 x 7## # Groups: school_pntg [7]## school_pntg D_FL F G I S X## <chr> <int> <int> <int> <int> <int> <int>## 1 A 0 0 0 0 0 0## 2 D/FL 1 0 0 0 0 0## 3 F 0 1 0 0 0 0## 4 G 0 0 1 0 0 0## 5 I 0 0 0 1 0 0## 6 S 0 0 0 0 1 0## 7 X 0 0 0 0 0 1
Interpret the intercept, slope for Dutch/Flemish, slope for Spanish paintings.
## # A tibble: 7 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 14. 10.0 1.40 0.162 ## 2 school_pntgD/FL 2.33 10.0 0.232 0.816 ## 3 school_pntgF 10.2 10.0 1.02 0.309 ## 4 school_pntgG 1.65 11.9 0.139 0.889 ## 5 school_pntgI 10.3 10.0 1.02 0.306 ## 6 school_pntgS 30.4 11.4 2.68 0.00744## 7 school_pntgX 2.87 10.3 0.279 0.780
^y=β0+β1 x1+β2 x2+⋯+βk xk
^y=β0+β1 x1+β2 x2+⋯+βk xk
^y=b0+b1 x1+b2 x2+⋯+bk xk
Remember this when interpreting model coefficients
Source: XKCD, Cell phones
On average, how tall are paintings that are 60 inches wide?
ˆHeightin=3.62+0.78 Widthin
On average, how tall are paintings that are 60 inches wide?
ˆHeightin=3.62+0.78 Widthin
3.62 + 0.78 * 60
## [1] 50.42
On average, how tall are paintings that are 60 inches wide?
ˆHeightin=3.62+0.78 Widthin
3.62 + 0.78 * 60
## [1] 50.42
"On average, we expect paintings that are 60 inches wide to be 50.42 inches high."
On average, how tall are paintings that are 60 inches wide?
ˆHeightin=3.62+0.78 Widthin
3.62 + 0.78 * 60
## [1] 50.42
"On average, we expect paintings that are 60 inches wide to be 50.42 inches high."
Warning: We "expect" this to happen, but there will be some variability. (We'll learn about measuring the variability around the prediction later.)
Use the predict()
function to make a prediction:
newpainting <- tibble(Width_in = 60)predict(m_ht_wt, newdata = newpainting)
## 1 ## 50.46919
newpaintings <- tibble(Width_in = c(50, 60, 70))predict(m_ht_wt, newdata = newpaintings)
## 1 2 3 ## 42.66123 50.46919 58.27715
On average, how tall are paintings that are 400 inches wide? ˆHeightin=3.62+0.78 Widthin
"When those blizzards hit the East Coast this winter, it proved to my satisfaction that global warming was a fraud. That snow was freezing cold. But in an alarming trend, temperatures this spring have risen. Consider this: On February 6th it was 10 degrees. Today it hit almost 80. At this rate, by August it will be 220 degrees. So clearly folks the climate debate rages on."1
Stephen Colbert, April 6th, 2010
[1] OpenIntro Statistics. "Extrapolation is treacherous." OpenIntro Statistics.
Which of the following is the correct interpretation of R2 of the model below.
m_ht_lands <- lm(Height_in ~ factor(landsALL), data = pp)glance(m_ht_lands)$r.squared
## [1] 0.03456724
(a) 3.5% of the variability in heights of painting can be explained by whether or not the painting has landscape features.
(b) 3.5% of the variability in whether a painting has landscape features can be explained by heights of paintings.
(c) This model predicts 3.5% of heights of paintings accurately.
(d) This model predicts 3.5% of whether a painting has landscape features accurately.
(e) 3.5% of the time there is an association between whether a painting has landscape features and its height.
glance(m_ht_wt)
## # A tibble: 1 x 11## r.squared adj.r.squared sigma statistic p.value df logLik AIC## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>## 1 0.683 0.683 8.30 6749. 0 2 -11083. 22173.## # … with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>
glance(m_ht_wt)$r.squared # extract R-squared
## [1] 0.6829468
Roughly 68% of the variability in heights of paintings can be explained by their widths.
m_ht_wt
## ## Call:## lm(formula = Height_in ~ Width_in, data = pp)## ## Coefficients:## (Intercept) Width_in ## 3.6214 0.7808
summary(m_ht_wt)
## ## Call:## lm(formula = Height_in ~ Width_in, data = pp)## ## Residuals:## Min 1Q Median 3Q Max ## -86.714 -4.384 -2.422 3.169 85.084 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 3.621406 0.253860 14.27 <2e-16## Width_in 0.780796 0.009505 82.15 <2e-16## ## Residual standard error: 8.304 on 3133 degrees of freedom## (258 observations deleted due to missingness)## Multiple R-squared: 0.6829, Adjusted R-squared: 0.6828 ## F-statistic: 6749 on 1 and 3133 DF, p-value: < 2.2e-16
What makes a data frame tidy?
Achieved with functions from the broom package:
tidy
: Constructs a data frame that summarizes the model's statistical findings: coefficient estimates, standard errors, test statistics, p-values.glance
: Constructs a concise one-row summary of the model. This typically contains values such as R2, adjusted R2, and residual standard error that are computed once for the entire model.augment
: Adds columns to the original data that was modeled. This includes predictions and residuals.tidy(m_ht_wt)
## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 3.62 0.254 14.3 8.82e-45## 2 Width_in 0.781 0.00950 82.1 0.
glance(m_ht_wt)
## # A tibble: 1 x 11## r.squared adj.r.squared sigma statistic p.value df logLik AIC## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>## 1 0.683 0.683 8.30 6749. 0 2 -11083. 22173.## # … with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>
New variables of note (for now):
.fitted
: Predicted value of the response variable.resid
: Residualsaugment(m_ht_wt) %>% slice(1:5)
## # A tibble: 5 x 10## .rownames Height_in Width_in .fitted .se.fit .resid .hat .sigma## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 1 37 29.5 26.7 0.166 10.3 3.99e-4 8.30## 2 2 18 14 14.6 0.165 3.45 3.96e-4 8.31## 3 3 13 16 16.1 0.158 -3.11 3.61e-4 8.31## 4 4 14 18 17.7 0.152 -3.68 3.37e-4 8.31## 5 5 14 18 17.7 0.152 -3.68 3.37e-4 8.31## # … with 2 more variables: .cooksd <dbl>, .std.resid <dbl>
New variables of note (for now):
.fitted
: Predicted value of the response variable.resid
: Residualsaugment(m_ht_wt) %>% slice(1:5)
## # A tibble: 5 x 10## .rownames Height_in Width_in .fitted .se.fit .resid .hat .sigma## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 1 37 29.5 26.7 0.166 10.3 3.99e-4 8.30## 2 2 18 14 14.6 0.165 3.45 3.96e-4 8.31## 3 3 13 16 16.1 0.158 -3.11 3.61e-4 8.31## 4 4 14 18 17.7 0.152 -3.68 3.37e-4 8.31## 5 5 14 18 17.7 0.152 -3.68 3.37e-4 8.31## # … with 2 more variables: .cooksd <dbl>, .std.resid <dbl>
Why might we be interested in these new variables?
m_ht_wt_aug <- augment(m_ht_wt)ggplot(m_ht_wt_aug, mapping = aes(x = .fitted, y = .resid)) + geom_point(alpha = 0.5) + geom_hline(yintercept = 0, color = "gray", lty = 2) + labs(x = "Predicted height", y = "Residuals")
What patterns does the residuals plot reveal that should make us question whether a linear model is a good fit for modeling the relationship between height and width of paintings?
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |