Scientific studies and confounding 😖

# Scientific studies and confounding <br> 😖
### Dr. Çetinkaya-Rundel

---

layout: true
  
<div class="my-footer">
<span>
Dr. Mine Çetinkaya-Rundel -
<a href="https://introds.org" target="_blank">introds.org
</a>
</span>
</div>

---

## Week 5

- Scientific studies, confounding, planning, and effective communication
- Project proposals
- Course and team evals
- Team meetings
- Regrade requests

---

# Scientific studies

---

## Scientific studies

.pull-left[
**Observational**  
- Collect data in a way that does not interfere with how the data arise ("observe")
- Only establish an association
]
.pull-right[
**Experimental**  
- Randomly assign subjects to treatments
- Establish causal connections
]

<br>

.question[
👥 Design a study comparing average energy levels of people who do and do not exercise -- both as an observational study and as an experiment.
]

---

## Study: Breakfast cereal keeps girls slim

.midi[
*Girls who ate breakfast of any type had a lower average body mass index, a common obesity gauge, than those who said they didn't. The index was even lower for girls who said they ate cereal for breakfast, according to findings of the study conducted by the Maryland Medical Research Institute with funding from the National Institutes of Health (NIH) and cereal-maker General Mills.*
 
[...]
 
*The results were gleaned from a larger NIH survey of 2,379 girls in California, Ohio, and Maryland who were tracked between the ages of 9 and 19.*
 
[...]
 
*As part of the survey, the girls were asked once a year what they had eaten during the previous three days.*
 
[...]
]

.footnote[
Souce: [Study: Cereal Keeps Girls Slim](https://www.cbsnews.com/news/study-cereal-keeps-girls-slim/), Retrieved Sep 13, 2018.
]

---

### 3 possible explanations

- Eating breakfast causes girls to be slimmer

--
- Being slim causes girls to eat breakfast

--
- A third variable is responsible for both -- a **confounding** variable: an extraneous variable that affects both the explanatory and the response variable, and that make it seem like there is a relationship between them

---

## Correlation != causation

---

## Stu!dies and conclusions

---

# Conditional probability

---

## Conditional probability

**Notation**: `\(P(A | B)\)`: Probability of event A given event B

- What is the probability that it be unseasonably warm tomorrow?
- What is the probability that it be unseasonably warm tomorrow, given that it it was unseasonably warm tomorrow?

---

.midi[
A July 2019 YouGov survey asked 1633 GB and 1333 USA randomly selected adults 
which of the following statements about the global environment best describes 
their view:
.small[
- The climate is changing and human activity is mainly responsible  
- The climate is changing and human activity is partly responsible, together with other factors  
- The climate is changing but human activity is not responsible at all  
- The climate is not changing  
]
The distribution of the responses by country of respondent is shown below.
]

<br>

.small[
<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> The climate is changing and human activity is mainly responsible </th>
   <th style="text-align:right;"> The climate is changing and human activity is partly responsible, together with other factors </th>
   <th style="text-align:right;"> The climate is changing but human activity is not responsible at all </th>
   <th style="text-align:right;"> The climate is not changing </th>
   <th style="text-align:right;"> Don't know </th>
   <th style="text-align:right;"> Sum </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> GB </td>
   <td style="text-align:right;width: 0.5 in; "> 833 </td>
   <td style="text-align:right;width: 0.5 in; "> 604 </td>
   <td style="text-align:right;width: 0.5 in; "> 49 </td>
   <td style="text-align:right;width: 0.5 in; "> 33 </td>
   <td style="text-align:right;"> 114 </td>
   <td style="text-align:right;"> 1633 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> US </td>
   <td style="text-align:right;width: 0.5 in; "> 507 </td>
   <td style="text-align:right;width: 0.5 in; "> 493 </td>
   <td style="text-align:right;width: 0.5 in; "> 120 </td>
   <td style="text-align:right;width: 0.5 in; "> 80 </td>
   <td style="text-align:right;"> 133 </td>
   <td style="text-align:right;"> 1333 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sum </td>
   <td style="text-align:right;width: 0.5 in; "> 1340 </td>
   <td style="text-align:right;width: 0.5 in; "> 1097 </td>
   <td style="text-align:right;width: 0.5 in; "> 169 </td>
   <td style="text-align:right;width: 0.5 in; "> 113 </td>
   <td style="text-align:right;"> 247 </td>
   <td style="text-align:right;"> 2966 </td>
  </tr>
</tbody>
</table>
]

.footnote[
Source: [YouGov - International Climate Change Survey](https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/epjj0nusce/YouGov%20-%20International%20climate%20change%20survey.pdf)
]

---

.question[
👥 
- What percent of (1) all respondents, (2) GB respondents,  
(3) US respondents think the climate is changing and  
human activity is mainly responsible?  
- Based on the percentages you calculate, does there appear to be a relationship 
between country ands beliefs about climate change? Explain your reasoning.  
- If yes, could there be another variable that explains this relationship?
]

---

## Independence

.question[
👥 Inspired by the previous example and how we used the conditional probabilities to make conclusions, come up with a definition of independent events. If easier, you can keep the context limited to the example (independence/dependence of beliefs about climate change and country), but try to push yourself to make a more general statement.
]

---

# Simpson's paradox

---

## Relationships between variables

- Relationship between two variables: Fitness `\(\rightarrow\)` Heart health
- Relationship between multiple variables: Calories + Age + Fitness `\(\rightarrow\)` Heart health

---

## Relationship between two variables

<img src="w5_d1-studies-confounding_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" />
---

## Relationship between two variables

<img src="w5_d1-studies-confounding_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" />
---

## Considering a third variable

---

## Relationship between three variables

---

## Simpson's paradox

- Not considering an important variable when studying a relationship can result 
in **Simpson's paradox**.
- Simpson's paradox illustrates the effect the omission of an explanatory 
variable can have on the measure of association between another explanatory 
variable and a response variable. 
- In other words, the inclusion of a third variable in the analysis can change 
the apparent relationship between the other two variables.

---

## Berkeley admission data

- Study carried out by the graduate Division of the University of California, Berkeley in the early 70’s to evaluate whether there was a sex bias in graduate admissions.
- The data come from six departments. For confidentiality we'll call them A-F. 
- We have information on whether the applicant was male or female and whether they were admitted or rejected. 
- First, we will evaluate whether the percentage of males admitted is indeed higher than females, overall. Next, we will calculate the same percentage for each department.

---

## Data

```
## # A tibble: 4,526 x 3
##    admit    sex   dept 
##    <chr>    <chr> <chr>
##  1 Admitted Male  A    
##  2 Admitted Male  A    
##  3 Admitted Male  A    
##  4 Admitted Male  A    
##  5 Admitted Male  A    
##  6 Admitted Male  A    
##  7 Admitted Male  A    
##  8 Admitted Male  A    
##  9 Admitted Male  A    
## 10 Admitted Male  A    
## # … with 4,516 more rows
```

---

## Skim the data

```r
library(skimr)
*skim(ucb_admit)
```

```
## Skim summary statistics
##  n obs: 4526 
##  n variables: 3 
## 
## ── Variable type:character ───────────────────────────────────────────────────────────
##  variable missing complete    n min max empty n_unique
##     admit       0     4526 4526   8   8     0        2
##      dept       0     4526 4526   1   1     0        6
##       sex       0     4526 4526   4   6     0        2
```

---

## Overall sex distribution

.question[
What can you say about the overall sex distribution? Hint: Calculate the following probabilities: `\(P(Admit | Male)\)` and `\(P(Admit | Female)\)`.
]

```r
ucb_admit %>%
  count(sex, admit)
```

```
## # A tibble: 4 x 3
##   sex    admit        n
##   <chr>  <chr>    <int>
## 1 Female Admitted   557
## 2 Female Rejected  1278
## 3 Male   Admitted  1198
## 4 Male   Rejected  1493
```

---

## Overall sex distribution

```r
ucb_admit %>%
  count(sex, admit) %>%
  group_by(sex) %>%
  mutate(prop_admit = n / sum(n))
```

```
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex    admit        n prop_admit
##   <chr>  <chr>    <int>      <dbl>
## 1 Female Admitted   557      0.304
## 2 Female Rejected  1278      0.696
## 3 Male   Admitted  1198      0.445
## 4 Male   Rejected  1493      0.555
```

---

## Overall sex distribution

```r
ggplot(ucb_admit, mapping = aes(x = sex, fill = admit)) +
  geom_bar(position = "fill") + 
  labs(y = "", title = "Admit by sex")
```

![](w5_d1-studies-confounding_files/figure-html/unnamed-chunk-21-1.png)

---