So far we have done lots of estimation (mean, median, slope, etc.), i.e.
If you want to catch a fish, do you prefer a spear or a net?
If you want to estimate a population parameter, do you prefer to report a range of values the parameter might be in, or a single value?
If you want to estimate a population parameter, do you prefer to report a range of values the parameter might be in, or a single value?
Source: General election poll tracker: How do the parties compare?, 18 Nov 2019.
A plausible range of values for the population parameter is a confidence interval.
A plausible range of values for the population parameter is a confidence interval.
A plausible range of values for the population parameter is a confidence interval.
A plausible range of values for the population parameter is a confidence interval.
A plausible range of values for the population parameter is a confidence interval.
Suppose we split the class in half down the middle of the classroom and ask each student their heights. Then, we calculate the mean height of students on each side of the classroom. Would you expect these two means to be exactly equal, close but not equal, or wildly different?
Suppose we split the class in half down the middle of the classroom and ask each student their heights. Then, we calculate the mean height of students on each side of the classroom. Would you expect these two means to be exactly equal, close but not equal, or wildly different?
Suppose you randomly sample 50 students and 5 of them are left handed. If you were to take another random sample of 50 students, how many would you expect to be left handed? Would you be surprised if only 3 of them were left handed? Would you be surprised if 40 of them were left handed?
We can quantify the variability of sample statistics using
or
Take a guess! How much does a typical 3 BR flat in Edinburgh rents for?
Fifteen 3 BR flats in Edinburgh were randomly selected on rightmove.co.uk.
library(tidyverse)edi_3br <- read_csv2("data/edi-3br.csv") # ; separated
## # A tibble: 15 x 4## flat_id rent title address ## <chr> <dbl> <chr> <chr> ## 1 flat_01 825 3 bedroom apartment to re… Burnhead Grove, Edinburgh, Mid…## 2 flat_02 2400 3 bedroom flat to rent Simpson Loan, Quartermile, Edi…## 3 flat_03 1900 3 bedroom flat to rent FETTES ROW, NEW TOWN, EH3 6SE ## 4 flat_04 1500 3 bedroom apartment to re… Eyre Crescent, Edinburgh, Midl…## 5 flat_05 3250 3 bedroom flat to rent Walker Street, Edinburgh ## 6 flat_06 2145 3 bedroom flat to rent George Street, City Centre, Ed…## 7 flat_07 1500 3 bedroom flat to rent Waverley Place , Edinburgh EH7…## 8 flat_08 1950 3 bedroom flat to rent Drumsheugh Place, Edinburgh ## 9 flat_09 1725 3 bedroom flat to rent Crighton Place, Leith, Edinbur…## 10 flat_10 2995 3 bedroom flat to rent Simpson Loan, Meadows, Edinbur…## 11 flat_11 1400 3 bedroom flat to rent 42, Learmonth Court, Edinburgh…## 12 flat_12 1995 3 bedroom apartment to re… Chester Street, Edinburgh, Mid…## 13 flat_13 1250 3 bedroom duplex to rent Elmwood Terrace, Lochend, Edin…## 14 flat_14 1995 3 bedroom apartment to re… Great King Street, Edinburgh, …## 15 flat_15 1600 3 bedroom ground floor fl… Roseneath Terrace,Edinburgh,EH9
Generated assuming there are more flats like the ones in the observed sample... Population mean = ❓
for
loopsWork in groups to explain what is happening in each line of the code below.
set.seed(9014)boot_df <- tibble( replicate = 1:100, stat = rep(NA, 100) )for (i in 1:100){ boot_df$stat[i] <- edi_3br %>% sample_n(15, replace = TRUE) %>% summarise(stat = mean(rent)) %>% pull()}
ggplot(boot_df, aes(x = stat)) + geom_histogram(binwidth = 100)
boot_df %>% summarise( lower = quantile(stat, 0.025), upper = quantile(stat, 0.975), )
## # A tibble: 1 x 2## lower upper## <dbl> <dbl>## 1 1594. 2292.
The objective of infer
is to perform statistical inference using an expressive statistical grammar that coheres with the tidyverse
design framework.
library(infer)
edi_3br %>% # specify the variable of interest specify(response = rent)
edi_3br %>% # specify the variable of interest specify(response = rent) # generate 15000 bootstrap samples generate(reps = 15000, type = "bootstrap")
edi_3br %>% # specify the variable of interest specify(response = rent) # generate 15000 bootstrap samples generate(reps = 15000, type = "bootstrap") # calculate the mean of each bootstrap sample calculate(stat = "mean")
# save resulting bootstrap distributionboot_df <- edi_3br %>% # specify the variable of interest specify(response = rent) %>% # generate 15000 bootstrap samples generate(reps = 15000, type = "bootstrap") %>% # calculate the mean of each bootstrap sample calculate(stat = "mean")
How many observations are there in boot_df
? What does each observation represent?
boot_df
## # A tibble: 15,000 x 2## replicate stat## <int> <dbl>## 1 1 1793.## 2 2 1938.## 3 3 2175 ## 4 4 2159.## 5 5 2084 ## 6 6 1761 ## 7 7 1787.## 8 8 1817.## 9 9 1963.## 10 10 1800.## # … with 14,990 more rows
ggplot(data = boot_df, mapping = aes(x = stat)) + geom_histogram(binwidth = 100) + labs(title = "Bootstrap distribution of means")
A 95% confidence interval is bounded by the middle 95% of the bootstrap distribution.
boot_df %>% summarize(lower = quantile(stat, 0.025), upper = quantile(stat, 0.975))
## # A tibble: 1 x 2## lower upper## <dbl> <dbl>## 1 1603. 2213.
The 95% confidence interval for the mean rent of three bedroom flats in Edinburgh was calculated as (1603, 2213). Which of the following is the correct interpretation of this interval?
(a) 95% of the time the mean rent of three bedroom flats in this sample is between £1603 and £2213.
(b) 95% of all three bedroom flats in Edinburgh have rents between £1603 and £2213.
(c) We are 95% confident that the mean rent of all three bedroom flats is between £1603 and £2213.
(d) We are 95% confident that the mean rent of three bedroom flats in this sample is between £1603 and £2213.
We are 95% confident that ...
Which line (orange dash/dot, blue dash, green dot) represents which confidence level?
If we want to be very certain that we capture the population parameter, should we use a wider interval or a narrower interval? What drawbacks are associated with using a wider interval?
If we want to be very certain that we capture the population parameter, should we use a wider interval or a narrower interval? What drawbacks are associated with using a wider interval?
If we want to be very certain that we capture the population parameter, should we use a wider interval or a narrower interval? What drawbacks are associated with using a wider interval?
How can we get best of both worlds -- high precision and high accuracy?
How would you modify the following code to calculate a 90% confidence interval? How would you modify it for a 99% confidence interval?
edi_3br %>% specify(response = rent) %>% generate(reps = 15000, type = "bootstrap") %>% calculate(stat = "mean") %>% summarize(lower = quantile(stat, 0.025), upper = quantile(stat, 0.975))
calculate(stat = "mean")
calculate(stat = "median")
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |