class: center, middle, inverse, title-slide # Tips for effective data visualization
💅 ### Dr. Çetinkaya-Rundel --- layout: true <div class="my-footer"> <span> Dr. Mine Çetinkaya-Rundel - <a href="https://introds.org" target="_blank">introds.org </a> </span> </div> --- ## Week 4 - Preparing for tomorrow's workshop: Make sure you complete the required reading for the week, and check your email before workshop to find out if there were any changes to your team. - Pull requests for feedback on code style. .question[ .large[ Any questions? ] ] --- class: center, middle # Wrapping up last week's material... --- ## Vectors vs. lists .pull-left[ .small[ ```r x <- c(8,4,7) ``` ] .small[ ```r x[1] ``` ``` ## [1] 8 ``` ] .small[ ```r x[[1]] ``` ``` ## [1] 8 ``` ] ] -- .pull-right[ .small[ ```r y <- list(8,4,7) ``` ] .small[ ```r y[2] ``` ``` ## [[1]] ## [1] 4 ``` ] .small[ ```r y[[2]] ``` ``` ## [1] 4 ``` ] ] -- <br> **Note:** When using tidyverse code you'll rarely need to refer to elements using square brackets, but it's good to be aware of this syntax, especially since you might encounter it when searching for help online. --- <img src="img/hadley-salt-pepper.png" width="90%" /> --- class: center, middle # Data "set" --- ## Data "sets" in R - "set" is in quotation marks because it is not a formal data class - A tidy data "set" can be one of the following types: + `tibble` + `data.frame` - We'll often work with `tibble`s: + `readr` package (e.g. `read_csv` function) loads data as a `tibble` by default + `tibble`s are part of the tidyverse, so they work well with other packages we are using + they make minimal assumptions about your data, so are less likely to cause hard to track bugs in your code --- ## Data frames - A data frame is the most commonly used data structure in R, they are just a list of equal length vectors (usually atomic, but you can use generic as well). Each vector is treated as a column and elements of the vectors as rows. - A tibble is a type of data frame that ... makes your life (i.e. data analysis) easier. - Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch. --- ## Data frames (cont.) ```r df <- tibble( x = 1:3, y = c("a", "b", "c") ) class(df) ``` ``` ## [1] "tbl_df" "tbl" "data.frame" ``` ```r glimpse(df) ``` ``` ## Observations: 3 ## Variables: 2 ## $ x <int> 1, 2, 3 ## $ y <chr> "a", "b", "c" ``` --- ## Data frames (cont.) ```r attributes(df) ``` ``` ## $names ## [1] "x" "y" ## ## $row.names ## [1] 1 2 3 ## ## $class ## [1] "tbl_df" "tbl" "data.frame" ``` .pull-left[ ```r class(df$x) ``` ``` ## [1] "integer" ``` ] .pull-right[ ```r class(df$y) ``` ``` ## [1] "character" ``` ] --- ## Working with tibbles in pipelines .question[ How many of the 60 respondents have below average number of cats? ] ```r # Calculate mean number of cats and store it mean_cats <- cat_lovers %>% summarise(mean_cats = mean(number_of_cats)) # Filter for where number_of_cats is less than mean_cats cat_lovers %>% filter(number_of_cats < mean_cats) %>% nrow() ``` ``` ## [1] 60 ``` -- .question[ Do you believe this number? Why, why not? ] --- ## A result of a pipeline is always a tibble ```r mean_cats ``` ``` ## # A tibble: 1 x 1 ## mean_cats ## <dbl> ## 1 0.817 ``` ```r class(mean_cats) ``` ``` ## [1] "tbl_df" "tbl" "data.frame" ``` --- ## `pull()` can be your new best friend But use it sparingly! ```r mean_cats <- cat_lovers %>% summarise(mean_cats = mean(number_of_cats)) %>% pull() mean_cats ``` ``` ## [1] 0.8166667 ``` ```r cat_lovers %>% filter(number_of_cats < mean_cats) %>% nrow() ``` ``` ## [1] 33 ``` -- .pull-left[ ```r mean_cats ``` ``` ## [1] 0.8166667 ``` ] .pull-right[ ```r class(mean_cats) ``` ``` ## [1] "numeric" ``` ] --- class: center, middle # Factors --- ## Factors Factor objects are how R stores data for categorical variables (fixed numbers of discrete values). ```r (x = factor(c("BS", "MS", "PhD", "MS"))) ``` ``` ## [1] BS MS PhD MS ## Levels: BS MS PhD ``` ```r glimpse(x) ``` ``` ## Factor w/ 3 levels "BS","MS","PhD": 1 2 3 2 ``` ```r typeof(x) ``` ``` ## [1] "integer" ``` --- ## Read data in as character strings ```r glimpse(cat_lovers) ``` ``` ## Observations: 60 ## Variables: 3 ## $ name <chr> "Bernice Warren", "Woodrow Stone", "Willie Bass",… ## $ number_of_cats <dbl> 0, 0, 1, 3, 3, 2, 1, 1, 0, 0, 0, 0, 1, 3, 3, 2, 1… ## $ handedness <chr> "left", "left", "left", "left", "left", "left", "… ``` --- ## But coerce to factor when plotting ```r ggplot(cat_lovers, mapping = aes(x = handedness)) + geom_bar() ``` <img src="w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-23-1.png" width="80%" /> --- ## Use forcats to manipulate factors ```r cat_lovers <- cat_lovers %>% mutate(handedness = fct_relevel( handedness, "right", "left", "ambidextrous" )) ``` <img src="w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-25-1.png" width="70%" /> --- ## Come for the functionality .pull-left[ ... stay for the logo ] .pull-right[ <img src="img/forcats-part-of-tidyverse.png" width="60%" /> ] - R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. Historically, factors were much easier to work with than character vectors, so many base R functions automatically convert character vectors to factors. - Factors are useful when you have true categorical data, and when you want to override the ordering of character vectors to improve display. The **forcats** package provides a suite of useful tools that solve common problems with factors. .footnote[ Source: [forcats.tidyverse.org](http://forcats.tidyverse.org/) ] --- ## Recap - Always best to think of data as part of a tibble + This plays nicely with the `tidyverse` as well + Rows are observations, columns are variables -- - Be careful about data types / classes + Sometimes `R` makes silly assumptions about your data class + Using `tibble`s help, but it might not solve all issues + Think about your data in context, e.g. 0/1 variable is most likely a `factor` + If a plot/output is not behaving the way you expect, first investigate the data class + If you are absolutely sure of a data class, overwrite it in your tibble so that you don't need to keep having to keep track of it + `mutate` the variable with the correct class --- class: center, middle # Designing effective visualizations --- ## Keep it simple <img src="img/pie-3d.jpg" width="300" style="display: block; margin: auto;" /> <img src="w4_d1-effective-dataviz_files/figure-html/pie-to-bar-1.png" width="600" style="display: block; margin: auto;" /> --- ## Use color to draw attention <img src="w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-27-1.png" width="500" style="display: block; margin: auto;" /> <img src="w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-28-1.png" width="600" style="display: block; margin: auto;" /> --- ## Tell a story <img src="img/time-series.story.png" width="800" style="display: block; margin: auto;" /> .footnote[ Credit: Angela Zoss and Eric Monson, Duke DVS ] --- class: center, middle # Principles for effective visualizations --- ## Principles for effective visualizations - Order matters - Put long categories on the y-axis - Keep scales consistent - Select meaningful colors - Use meaningful and nonredundant labels --- ## Data In September 2019, YouGov survey asked 1,639 GB adults the following question: > In hindsight, do you think Britain was right/wrong to vote to leave EU? > >- Right to leave >- Wrong to leave >- Don't know .footnote[ Source: [YouGov Survey Results](https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/x0msmggx08/YouGov%20-%20Brexit%20and%202019%20election.pdf), retrieved Oct 7, 2019 ] --- class: center, middle # Order matters --- ## Alphabetical order is rarely ideal ```r ggplot(data = brexit, aes(x = opinion)) + geom_bar() ``` <img src="w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-31-1.png" width="85%" /> --- ## Order by frequency `fct_infreq`: Reorder factors levels by frequency ```r ggplot(data = brexit, aes(x = fct_infreq(opinion))) + geom_bar() ``` <img src="w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-32-1.png" width="75%" /> --- ## Clean up labels ```r ggplot(data = brexit, aes(x = opinion)) + geom_bar() + labs(x = "Opinion", y = "Count") ``` <img src="w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-33-1.png" width="80%" /> --- ## Alphabetical order is rarely ideal ```r ggplot(data = brexit, aes(x = region)) + geom_bar() ``` <img src="w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-34-1.png" width="85%" /> --- ## Use inherent level order `fct_relevel`: Reorder factor levels using a custom order .midi[ ```r brexit <- brexit %>% mutate( region = fct_relevel( region, "london", "rest_of_south", "midlands_wales", "north", "scot" ) ) ``` ] <img src="w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-36-1.png" width="55%" /> --- ## Clean up labels ```r brexit <- brexit %>% mutate( region = fct_recode( region, London = "london", `Rest of South` = "rest_of_south", `Midlands / Wales` = "midlands_wales", North = "north", Scotland = "scot" ) ) ``` --- ## Clean up labels (cont.) ```r ggplot(data = brexit, aes(x = region)) + geom_bar() ``` <img src="w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-38-1.png" width="85%" /> --- class: center, middle # Put long categories on # the y-axis --- ## Long categories can be hard to read ```r ggplot(data = brexit, aes(x = region)) + geom_bar() ``` <img src="w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-39-1.png" width="85%" /> --- ## Move them to the y-axis ```r ggplot(data = brexit, aes(x = region)) + geom_bar() + coord_flip() ``` <img src="w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-40-1.png" width="80%" /> --- ## Move them to the y-axis ```r *ggplot(data = brexit, aes(x = fct_rev(region))) + geom_bar() + coord_flip() ``` <img src="w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-41-1.png" width="80%" /> --- ## Clean up labels ```r ggplot(data = brexit, aes(x = fct_rev(region))) + geom_bar() + * labs(x = "Region", y = "") + coord_flip() ``` <img src="w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-42-1.png" width="80%" /> --- class: center, middle # Pick a purpose --- ## Segmented bar plots can be hard to read ```r ggplot(data = brexit, aes(x = region, fill = opinion)) + geom_bar() + coord_flip() ``` <img src="w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-43-1.png" width="80%" /> --- ## Use facets ```r ggplot(data = brexit, aes(x = opinion, fill = region)) + geom_bar() + coord_flip() + facet_grid(. ~ region) ``` <img src="w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-44-1.png" width="80%" /> --- ## Avoid redundancy ```r ggplot(data = brexit, aes(x = opinion)) + geom_bar() + coord_flip() + facet_grid(. ~ region) ``` <img src="w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-45-1.png" width="80%" /> --- ## Informative labels ```r ggplot(data = brexit, aes(x = opinion)) + geom_bar() + coord_flip() + facet_grid(. ~ region) + labs( title = "Was Britain right/wrong to vote to leave EU?", * x = "", * y = "" ) ``` --- <img src="w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-47-1.png" width="90%" /> --- ## A bit more info ```r ggplot(data = brexit, aes(x = opinion)) + geom_bar() + coord_flip() + facet_grid(. ~ region) + labs( title = "Was Britain right/wrong to vote to leave EU?", * subtitle = "YouGov Survey Results, 2-3 September 2019", * caption = "Source: https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/x0msmggx08/YouGov%20-%20Brexit%20and%202019%20election.pdf", x = "", y = "") ``` --- ![](w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-49-1.png)<!-- --> --- ## Let's do better ```r ggplot(data = brexit, aes(x = opinion)) + geom_bar() + coord_flip() + facet_grid(. ~ region) + labs( title = "Was Britain right/wrong to vote to leave EU?", subtitle = "YouGov Survey Results, 2-3 September 2019", * caption = "Source: bit.ly/2lCJZVg", x = "", y = "" ) ``` --- ![](w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-51-1.png)<!-- --> --- ## Fix up facet labels ```r ggplot(data = brexit, aes(x = opinion)) + geom_bar() + coord_flip() + * facet_grid(. ~ region, labeller = label_wrap_gen(width = 12)) + labs( title = "Was Britain right/wrong to vote to leave EU?", subtitle = "YouGov Survey Results, 2-3 September 2019", caption = "Source: bit.ly/2lCJZVg", x = "", y = "" ) ``` --- ![](w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-53-1.png)<!-- --> --- class: center, middle # Select meaningful colors --- ## Rainbow colors are not always the right choice ```r ggplot(data = brexit, aes(x = region, fill = opinion)) + geom_bar(position = "fill") + coord_flip() ``` <img src="w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-54-1.png" width="85%" /> --- ## Viridis scale works well with ordinal data ```r ggplot(data = brexit, aes(x = region, fill = opinion)) + geom_bar(position = "fill") + coord_flip() + scale_fill_viridis_d() ``` <img src="w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-55-1.png" width="85%" /> --- ## Clean up labels <img src="w4_d1-effective-dataviz_files/figure-html/unnamed-chunk-56-1.png" width="85%" />