Tips for effective data visualization 💅

# Tips for effective data visualization <br> 💅
### Dr. Çetinkaya-Rundel

---

layout: true
  
<div class="my-footer">
<span>
Dr. Mine Çetinkaya-Rundel -
<a href="https://introds.org" target="_blank">introds.org
</a>
</span>
</div>

---

## Week 4

- Preparing for tomorrow's workshop: Make sure you complete the required reading for the week, and check your email before workshop to find out if there were any changes to your team.
- Pull requests for feedback on code style.

---

# Wrapping up last week's material...

---

## Vectors vs. lists

```r
x <- c(8,4,7)
```
]
.small[

```r
x[1]
```

```
## [1] 8
```
]
.small[

```r
x[[1]]
```

```
## [1] 8
```
]
]
--
.pull-right[
.small[

```r
y <- list(8,4,7)
```
]
.small[

```r
y[2]
```

```
## [[1]]
## [1] 4
```
]
.small[

```r
y[[2]]
```

```
## [1] 4
```
]
]

<br>

**Note:** When using tidyverse code you'll rarely need to refer to elements using square brackets, but it's good to be aware of this syntax, especially since you might encounter it when searching for help online.

---

---

# Data "set"

---

## Data "sets" in R

- "set" is in quotation marks because it is not a formal data class
- A tidy data "set" can be one of the following types:
    + `tibble`
    + `data.frame`
- We'll often work with `tibble`s:
    + `readr` package (e.g. `read_csv` function) loads data as a `tibble` by default
    + `tibble`s are part of the tidyverse, so they work well with other packages we are using
    + they make minimal assumptions about your data, so are less likely to cause hard to track bugs in your code

---

## Data frames

- A data frame is the most commonly used data structure in R, they are just a list of equal length vectors (usually atomic, but you can use generic as well). Each vector is treated as a column and elements of the vectors as rows.
- A tibble is a type of data frame that ... makes your life (i.e. data analysis) easier.
- Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch.

---

## Data frames (cont.)

```r
df <- tibble(
  x = 1:3, 
  y = c("a", "b", "c")
  )
class(df)
```

```
## [1] "tbl_df"     "tbl"        "data.frame"
```

```r
glimpse(df)
```

```
## Observations: 3
## Variables: 2
## $ x <int> 1, 2, 3
## $ y <chr> "a", "b", "c"
```

---

## Data frames (cont.)

```r
attributes(df)
```

```
## $names
## [1] "x" "y"
## 
## $row.names
## [1] 1 2 3
## 
## $class
## [1] "tbl_df"     "tbl"        "data.frame"
```
.pull-left[

```r
class(df$x)
```

```
## [1] "integer"
```
]
.pull-right[

```r
class(df$y)
```

```
## [1] "character"
```
]

---

## Working with tibbles in pipelines

```r
# Calculate mean number of cats and store it
mean_cats <- cat_lovers %>%
  summarise(mean_cats = mean(number_of_cats))

# Filter for where number_of_cats is less than mean_cats
cat_lovers %>%
  filter(number_of_cats < mean_cats) %>%
  nrow()
```

```
## [1] 60
```

---

## A result of a pipeline is always a tibble

```r
mean_cats
```

```
## # A tibble: 1 x 1
##   mean_cats
##       <dbl>
## 1     0.817
```

```r
class(mean_cats)
```

```
## [1] "tbl_df"     "tbl"        "data.frame"
```

---

## `pull()` can be your new best friend

But use it sparingly!

```r
mean_cats <- cat_lovers %>%
  summarise(mean_cats = mean(number_of_cats)) %>%
  pull()

mean_cats
```

```
## [1] 0.8166667
```

```r
cat_lovers %>%
  filter(number_of_cats < mean_cats) %>%
  nrow()
```

```
## [1] 33
```

```r
mean_cats
```

```
## [1] 0.8166667
```
]
.pull-right[

```r
class(mean_cats)
```

```
## [1] "numeric"
```
]

---

# Factors

---

## Factors

Factor objects are how R stores data for categorical variables (fixed numbers of discrete values).

```r
(x = factor(c("BS", "MS", "PhD", "MS")))
```

```
## [1] BS  MS  PhD MS 
## Levels: BS MS PhD
```

```r
glimpse(x)
```

```
##  Factor w/ 3 levels "BS","MS","PhD": 1 2 3 2
```

```r
typeof(x)
```

```
## [1] "integer"
```

---

## Read data in as character strings

```r
glimpse(cat_lovers)
```

```
## Observations: 60
## Variables: 3
## $ name           <chr> "Bernice Warren", "Woodrow Stone", "Willie Bass",…
## $ number_of_cats <dbl> 0, 0, 1, 3, 3, 2, 1, 1, 0, 0, 0, 0, 1, 3, 3, 2, 1…
## $ handedness     <chr> "left", "left", "left", "left", "left", "left", "…
```

---

## But coerce to factor when plotting

```r
ggplot(cat_lovers, mapping = aes(x = handedness)) +
  geom_bar()
```

---

## Use forcats to manipulate factors

```r
cat_lovers <- cat_lovers %>%
  mutate(handedness = fct_relevel(
    handedness,
    "right", "left", "ambidextrous"
  ))
```

---

## Come for the functionality

.pull-left[
... stay for the logo
]
.pull-right[
<img src="img/forcats-part-of-tidyverse.png" width="60%" />
]

- R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. Historically, factors were much easier to work with than character vectors, so many base R functions automatically convert character vectors to factors.
- Factors are useful when you have true categorical data, and when you want to override the ordering of character vectors to improve display. The **forcats** package provides a suite of useful tools that solve common problems with factors.

---

## Recap

- Always best to think of data as part of a tibble
    + This plays nicely with the `tidyverse` as well
    + Rows are observations, columns are variables
--
- Be careful about data types / classes
    + Sometimes `R` makes silly assumptions about your data class 
        + Using `tibble`s help, but it might not solve all issues
        + Think about your data in context, e.g. 0/1 variable is most likely a `factor`
    + If a plot/output is not behaving the way you expect, first
    investigate the data class
    + If you are absolutely sure of a data class, overwrite it in your
    tibble so that you don't need to keep having to keep track of it
        + `mutate` the variable with the correct class

---

# Designing effective visualizations

---

## Keep it simple

---

## Use	color	to	draw	attention

---

## Tell a story

---

# Principles	for	effective	visualizations

---

## Principles	for	effective	visualizations

- Order matters
- Put long categories on the y-axis
- Keep	scales consistent
- Select meaningful colors
- Use meaningful and nonredundant labels

---

## Data

In September 2019, YouGov survey asked 1,639 GB adults the following question:

> In hindsight, do you think Britain was right/wrong to vote to leave EU?
>
>- Right to leave  
>- Wrong to leave  
>- Don't know

.footnote[ 
Source: [YouGov Survey Results](https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/x0msmggx08/YouGov%20-%20Brexit%20and%202019%20election.pdf), retrieved Oct 7, 2019
]

---

# Order matters

---

## Alphabetical order is rarely ideal

```r
ggplot(data = brexit, aes(x = opinion)) +
  geom_bar()
```

---

## Order by frequency

`fct_infreq`: Reorder factors levels by frequency

```r
ggplot(data = brexit, aes(x = fct_infreq(opinion))) +
  geom_bar()
```

---

## Clean up labels

```r
ggplot(data = brexit, aes(x = opinion)) +
  geom_bar() +
  labs(x = "Opinion", y = "Count")
```

---

## Alphabetical order is rarely ideal

```r
ggplot(data = brexit, aes(x = region)) +
  geom_bar()
```

---

## Use inherent level order

`fct_relevel`: Reorder factor levels using a custom order

```r
brexit <- brexit %>%
  mutate(
    region = fct_relevel(
      region,
      "london", "rest_of_south", "midlands_wales", "north", "scot"
    )
  )
```
]

---

## Clean up labels

```r
brexit <- brexit %>%
  mutate(
    region = fct_recode(
      region,
      London = "london", 
      `Rest of South` = "rest_of_south", 
      `Midlands / Wales` = "midlands_wales", 
      North = "north", 
      Scotland = "scot"
    )
  )
```

---

## Clean up labels (cont.)

```r
ggplot(data = brexit, aes(x = region)) +
  geom_bar()
```

---

# Put long categories on 
# the y-axis

---

## Long categories can be hard to read

```r
ggplot(data = brexit, aes(x = region)) +
  geom_bar()
```

---

## Move them to the y-axis

```r
ggplot(data = brexit, aes(x = region)) +
  geom_bar() +
  coord_flip()
```

---

## Move them to the y-axis

```r
*ggplot(data = brexit, aes(x = fct_rev(region))) +
  geom_bar() +
  coord_flip()
```

---

## Clean up labels

```r
ggplot(data = brexit, aes(x = fct_rev(region))) +
  geom_bar() +
* labs(x = "Region", y = "") +
  coord_flip()
```

---

# Pick a purpose

---

## Segmented bar plots can be hard to read

```r
ggplot(data = brexit, aes(x = region, fill = opinion)) +
  geom_bar() +
  coord_flip()
```

---

## Use facets

```r
ggplot(data = brexit, aes(x = opinion, fill = region)) +
  geom_bar() +
  coord_flip() +
  facet_grid(. ~ region)
```

---

## Avoid redundancy

```r
ggplot(data = brexit, aes(x = opinion)) +
  geom_bar() +
  coord_flip() +
  facet_grid(. ~ region)
```

---

## Informative labels

```r
ggplot(data = brexit, aes(x = opinion)) +
  geom_bar() +
  coord_flip() +
  facet_grid(. ~ region) +
  labs(
    title = "Was Britain right/wrong to vote to leave EU?",
*   x = "",
*   y = ""
  ) 
```

---

---

## A bit more info

```r
ggplot(data = brexit, aes(x = opinion)) +
  geom_bar() +
  coord_flip() +
  facet_grid(. ~ region) +
  labs(
    title = "Was Britain right/wrong to vote to leave EU?",
*   subtitle = "YouGov Survey Results, 2-3 September 2019",
*   caption = "Source: https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/x0msmggx08/YouGov%20-%20Brexit%20and%202019%20election.pdf",
    x = "", 
    y = "")
```

---

![](w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-49-1.png)

---

## Let's do better

```r
ggplot(data = brexit, aes(x = opinion)) +
  geom_bar() +
  coord_flip() +
  facet_grid(. ~ region) +
  labs(
    title = "Was Britain right/wrong to vote to leave EU?",
    subtitle = "YouGov Survey Results, 2-3 September 2019",
*   caption = "Source: bit.ly/2lCJZVg",
    x = "", 
    y = ""
  )
```

---

![](w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-51-1.png)

---

## Fix up facet labels

```r
ggplot(data = brexit, aes(x = opinion)) +
  geom_bar() +
  coord_flip() +
* facet_grid(. ~ region, labeller = label_wrap_gen(width = 12)) +
  labs(
    title = "Was Britain right/wrong to vote to leave EU?",
    subtitle = "YouGov Survey Results, 2-3 September 2019",
    caption = "Source: bit.ly/2lCJZVg",
    x = "", 
    y = ""
  )
```

---

![](w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-53-1.png)

---

# Select meaningful colors

---

## Rainbow colors are not always the right choice

```r
ggplot(data = brexit, aes(x = region, fill = opinion)) +
  geom_bar(position = "fill") +
  coord_flip()
```

---

## Viridis scale works well with ordinal data

```r
ggplot(data = brexit, aes(x = region, fill = opinion)) +
  geom_bar(position = "fill") +
  coord_flip() +
  scale_fill_viridis_d()
```

---

## Clean up labels