+ - 0:00:00
Notes for current slide
Notes for next slide

Quantifying uncertainty
👢

Dr. Çetinkaya-Rundel

1 / 41

Announcements

  • Check presentation schedule and let me know if there are any concerns.
  • We'll start presentations at 12pm (not 12:10pm)!
2 / 41

Inference

3 / 41
  • Statistical inference is the process of using sample data to make conclusions about the underlying population the sample came from.
  • Similar to tasting a spoonful of soup while cooking to make an inference about the entire pot.

4 / 41

Estimation

So far we have done lots of estimation (mean, median, slope, etc.), i.e.

  • used data from samples to calculate sample statistics
  • which can then be used as estimates for population parameters
5 / 41

If you want to catch a fish, do you prefer a spear or a net?


6 / 41

If you want to estimate a population parameter, do you prefer to report a range of values the parameter might be in, or a single value?


7 / 41

If you want to estimate a population parameter, do you prefer to report a range of values the parameter might be in, or a single value?


  • If we report a point estimate, we probably won’t hit the exact population parameter.
  • If we report a range of plausible values we have a good shot at capturing the parameter.
7 / 41

8 / 41

Confidence intervals

9 / 41

Confidence intervals

A plausible range of values for the population parameter is a confidence interval.

10 / 41

Confidence intervals

A plausible range of values for the population parameter is a confidence interval.

  • In order to construct a confidence interval we need to quantify the variability of our sample statistic.
10 / 41

Confidence intervals

A plausible range of values for the population parameter is a confidence interval.

  • In order to construct a confidence interval we need to quantify the variability of our sample statistic.
  • For example, if we want to construct a confidence interval for a population mean, we need to come up with a plausible range of values around our observsed sample mean.
10 / 41

Confidence intervals

A plausible range of values for the population parameter is a confidence interval.

  • In order to construct a confidence interval we need to quantify the variability of our sample statistic.
  • For example, if we want to construct a confidence interval for a population mean, we need to come up with a plausible range of values around our observsed sample mean.
  • This range will depend on how precise and how accurate our sample mean is as an estimate of the population mean.
10 / 41

Confidence intervals

A plausible range of values for the population parameter is a confidence interval.

  • In order to construct a confidence interval we need to quantify the variability of our sample statistic.
  • For example, if we want to construct a confidence interval for a population mean, we need to come up with a plausible range of values around our observsed sample mean.
  • This range will depend on how precise and how accurate our sample mean is as an estimate of the population mean.
  • Quantifying this requires a measurement of how much we would expect the sample mean to vary from sample to sample.
10 / 41

Suppose we split the class in half down the middle of the classroom and ask each student their heights. Then, we calculate the mean height of students on each side of the classroom. Would you expect these two means to be exactly equal, close but not equal, or wildly different?

11 / 41

Suppose we split the class in half down the middle of the classroom and ask each student their heights. Then, we calculate the mean height of students on each side of the classroom. Would you expect these two means to be exactly equal, close but not equal, or wildly different?



Suppose you randomly sample 50 students and 5 of them are left handed. If you were to take another random sample of 50 students, how many would you expect to be left handed? Would you be surprised if only 3 of them were left handed? Would you be surprised if 40 of them were left handed?

11 / 41

Quantifying the variability of a sample statistic

We can quantify the variability of sample statistics using

  • simulation: via bootstrapping (today)

or

  • theory: via Central Limit Theorem (in Stat 2!)
12 / 41

Bootstrapping

13 / 41

Bootstrapping

  • The term bootstrapping comes from the phrase "pulling oneself up by one’s bootstraps", which is a metaphor for accomplishing an impossible task without any outside help.
  • In this case the impossible task is estimating a population parameter, and we’ll accomplish it using data from only the given sample.
  • Note that this notion of saying something about a population parameter using only information from an observed sample is the crux of statistical inference, it is not limited to bootstrapping.
14 / 41

Rent in Edinburgh

Take a guess! How much does a typical 3 BR flat in Edinburgh rents for?

15 / 41

Sample

Fifteen 3 BR flats in Edinburgh were randomly selected on rightmove.co.uk.

library(tidyverse)
edi_3br <- read_csv2("data/edi-3br.csv") # ; separated
## # A tibble: 15 x 4
## flat_id rent title address
## <chr> <dbl> <chr> <chr>
## 1 flat_01 825 3 bedroom apartment to re… Burnhead Grove, Edinburgh, Mid…
## 2 flat_02 2400 3 bedroom flat to rent Simpson Loan, Quartermile, Edi…
## 3 flat_03 1900 3 bedroom flat to rent FETTES ROW, NEW TOWN, EH3 6SE
## 4 flat_04 1500 3 bedroom apartment to re… Eyre Crescent, Edinburgh, Midl…
## 5 flat_05 3250 3 bedroom flat to rent Walker Street, Edinburgh
## 6 flat_06 2145 3 bedroom flat to rent George Street, City Centre, Ed…
## 7 flat_07 1500 3 bedroom flat to rent Waverley Place , Edinburgh EH7…
## 8 flat_08 1950 3 bedroom flat to rent Drumsheugh Place, Edinburgh
## 9 flat_09 1725 3 bedroom flat to rent Crighton Place, Leith, Edinbur…
## 10 flat_10 2995 3 bedroom flat to rent Simpson Loan, Meadows, Edinbur…
## 11 flat_11 1400 3 bedroom flat to rent 42, Learmonth Court, Edinburgh…
## 12 flat_12 1995 3 bedroom apartment to re… Chester Street, Edinburgh, Mid…
## 13 flat_13 1250 3 bedroom duplex to rent Elmwood Terrace, Lochend, Edin…
## 14 flat_14 1995 3 bedroom apartment to re… Great King Street, Edinburgh, …
## 15 flat_15 1600 3 bedroom ground floor fl… Roseneath Terrace,Edinburgh,EH9
16 / 41

Observed sample

17 / 41

Observed sample

Sample mean ≈ £1895 😱


18 / 41

Bootstrap population

Generated assuming there are more flats like the ones in the observed sample... Population mean = ❓


19 / 41

Bootstrapping scheme

  1. Take a bootstrap sample - a random sample taken with replacement from the original sample, of the same size as the original sample.
  2. Calculate the bootstrap statistic - a statistic such as mean, median, proportion, slope, etc. computed on the bootstrap samples.
  3. Repeat steps (1) and (2) many times to create a bootstrap distribution - a distribution of bootstrap statistics.
  4. Calculate the bounds of the XX% confidence interval as the middle XX% of the bootstrap distribution.
20 / 41

Let's bootstrap

21 / 41

Bootstrapping in R

22 / 41

Two ways

  1. Using for loops
  2. Using the infer package
23 / 41

Bootstrapping with for loops

Work in groups to explain what is happening in each line of the code below.

set.seed(9014)
boot_df <- tibble(
replicate = 1:100,
stat = rep(NA, 100)
)
for (i in 1:100){
boot_df$stat[i] <- edi_3br %>%
sample_n(15, replace = TRUE) %>%
summarise(stat = mean(rent)) %>%
pull()
}
24 / 41

Bootstrap results

ggplot(boot_df, aes(x = stat)) +
geom_histogram(binwidth = 100)

boot_df %>%
summarise(
lower = quantile(stat, 0.025),
upper = quantile(stat, 0.975),
)
## # A tibble: 1 x 2
## lower upper
## <dbl> <dbl>
## 1 1594. 2292.
25 / 41

infer tidymodels

The objective of infer is to perform statistical inference using an expressive statistical grammar that coheres with the tidyverse design framework.

library(infer)
26 / 41

Generate bootstrap means

edi_3br %>%
# specify the variable of interest
specify(response = rent)
27 / 41

Generate bootstrap means

edi_3br %>%
# specify the variable of interest
specify(response = rent)
# generate 15000 bootstrap samples
generate(reps = 15000, type = "bootstrap")
28 / 41

Generate bootstrap means

edi_3br %>%
# specify the variable of interest
specify(response = rent)
# generate 15000 bootstrap samples
generate(reps = 15000, type = "bootstrap")
# calculate the mean of each bootstrap sample
calculate(stat = "mean")
29 / 41

Generate bootstrap means

# save resulting bootstrap distribution
boot_df <- edi_3br %>%
# specify the variable of interest
specify(response = rent) %>%
# generate 15000 bootstrap samples
generate(reps = 15000, type = "bootstrap") %>%
# calculate the mean of each bootstrap sample
calculate(stat = "mean")
30 / 41

The bootstrap sample

How many observations are there in boot_df? What does each observation represent?

boot_df
## # A tibble: 15,000 x 2
## replicate stat
## <int> <dbl>
## 1 1 1793.
## 2 2 1938.
## 3 3 2175
## 4 4 2159.
## 5 5 2084
## 6 6 1761
## 7 7 1787.
## 8 8 1817.
## 9 9 1963.
## 10 10 1800.
## # … with 14,990 more rows
31 / 41

Visualize the bootstrap distribution

ggplot(data = boot_df, mapping = aes(x = stat)) +
geom_histogram(binwidth = 100) +
labs(title = "Bootstrap distribution of means")

32 / 41

Calculate the confidence interval

A 95% confidence interval is bounded by the middle 95% of the bootstrap distribution.

boot_df %>%
summarize(lower = quantile(stat, 0.025),
upper = quantile(stat, 0.975))
## # A tibble: 1 x 2
## lower upper
## <dbl> <dbl>
## 1 1603. 2213.
33 / 41

Visualize the confidence interval

34 / 41

Interpret the confidence interval

The 95% confidence interval for the mean rent of three bedroom flats in Edinburgh was calculated as (1603, 2213). Which of the following is the correct interpretation of this interval?


(a) 95% of the time the mean rent of three bedroom flats in this sample is between £1603 and £2213.

(b) 95% of all three bedroom flats in Edinburgh have rents between £1603 and £2213.

(c) We are 95% confident that the mean rent of all three bedroom flats is between £1603 and £2213.

(d) We are 95% confident that the mean rent of three bedroom flats in this sample is between £1603 and £2213.

35 / 41

Accuracy vs. precision

36 / 41

Confidence level

We are 95% confident that ...

  • Suppose we took many samples from the original population and built a 95% confidence interval based on each sample.
  • Then about 95% of those intervals would contain the true population parameter.
37 / 41

Commonly used confidence levels

Which line (orange dash/dot, blue dash, green dot) represents which confidence level?

38 / 41

Precision vs. accuracy

If we want to be very certain that we capture the population parameter, should we use a wider interval or a narrower interval? What drawbacks are associated with using a wider interval?

39 / 41

Precision vs. accuracy

If we want to be very certain that we capture the population parameter, should we use a wider interval or a narrower interval? What drawbacks are associated with using a wider interval?

garfield

39 / 41

Precision vs. accuracy

If we want to be very certain that we capture the population parameter, should we use a wider interval or a narrower interval? What drawbacks are associated with using a wider interval?

garfield

How can we get best of both worlds -- high precision and high accuracy?

39 / 41

Changing confidence level

How would you modify the following code to calculate a 90% confidence interval? How would you modify it for a 99% confidence interval?

edi_3br %>%
specify(response = rent) %>%
generate(reps = 15000, type = "bootstrap") %>%
calculate(stat = "mean") %>%
summarize(lower = quantile(stat, 0.025),
upper = quantile(stat, 0.975))
40 / 41

Recap

  • Sample statistic population parameter, but if the sample is good, it can be a good estimate.
  • We report the estimate with a confidence interval, and the width of this interval depends on the variability of sample statistics from different samples from the population.
  • Since we can't continue sampling from the population, we bootstrap from the one sample we have to estimate sampling variability.
  • We can do this for any sample statistic:
    • For a mean: calculate(stat = "mean")
    • For a median: calculate(stat = "median")
    • Learn about calculating bootstrap intervals for other statistics in your homework.
41 / 41

Announcements

  • Check presentation schedule and let me know if there are any concerns.
  • We'll start presentations at 12pm (not 12:10pm)!
2 / 41
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow