class: center, middle, inverse, title-slide # Tidy data and data wrangling
🔧 ### Dr. Çetinkaya-Rundel --- layout: true <div class="my-footer"> <span> Dr. Mine Çetinkaya-Rundel - <a href="https://introds.org" target="_blank">introds.org </a> </span> </div> --- ## Week 3 - Preparing for tomorrow's workshop: Check your email tomorrow morning for your lab team assignment, find your teammates when you get in the workshop classroom, and sit together. - Student hours back to usual time: Tuesdays 14:30 - 16:30 - Piazza: Code as reproducible example + Verbatim error + Informative title - New section on website with links to cheat sheets + more... .question[ .large[ Any questions? ] ] --- class: center, middle # Coding style --- ## Style guide >"Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread." > >Hadley Wickham - Style guide for this course is based on the Tidyverse style guide: http://style.tidyverse.org/ - There's more to it than what we'll cover today, but we'll mention more as we introduce more functionality, and do a recap later in the semester --- ## File names and code chunk labels - Do not use spaces in file names and code chunk labels, use `-` or `_` to separate words - Use all lowercase letters ```r # Good ucb-admit.csv # Bad UCB Admit.csv ``` --- ## Object names - Use `_` to separate words in object names - Use informative but short object names - Do not reuse object names within an analysis ```r # Good acs_employed # Bad acs.employed acs2 acs_subset acs_subsetted_for_males ``` --- ## Spacing - Put a space before and after all infix operators (=, +, -, <-, etc.), and when naming arguments in function calls - Always put a space after a comma, and never before (just like in regular English) ```r # Good average <- mean(feet / 12 + inches, na.rm = TRUE) # Bad average<-mean(feet/12+inches,na.rm=TRUE) ``` --- ## ggplot2 - Always end a line with `+` - Always indent the next line ```r # Good ggplot(diamonds, mapping = aes(x = price)) + geom_histogram() # Bad ggplot(diamonds,mapping=aes(x=price))+geom_histogram() ``` --- class: center, middle # Tidy data --- ## Tidy data >Happy families are all alike; every unhappy family is unhappy in its own way. > >Leo Tolstoy -- .pull-left[ **Characteristics of tidy data:** - Each variable forms a column. - Each observation forms a row. - Each type of observational unit forms a table. ] -- .pull-right[ **Characteristics of untidy data:** !@#$%^&*() ] --- ## .question[ What makes this data not tidy? ] <img src="img/untidy-data/hyperwar-airplanes-on-hand.png" width="90%" style="display: block; margin: auto;" /> .footnote[ Source: [Army Air Forces Statistical Digest, WW II](https://www.ibiblio.org/hyperwar/AAF/StatDigest/aafsd-3.html) ] --- .question[ What makes this data not tidy? ] <br> <img src="img/untidy-data/hiv-est-prevalence-15-49.png" width="95%" style="display: block; margin: auto;" /> .footnote[ Source: [Gapminder, Estimated HIV prevalence among 15-49 year olds](https://www.gapminder.org/data) ] --- .question[ What makes this data not tidy? ] <br> <img src="img/untidy-data/us-general-economic-characteristic-acs-2017.png" width="95%" style="display: block; margin: auto;" /> .footnote[ Source: [US Census Fact Finder, General Economic Characteristics, ACS 2017](https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_DP03&src=pt) ] --- ## Summary tables .question[ Is each of the following a dataset or a summary table? ] .midi[ .pull-left[ ``` ## # A tibble: 87 x 3 ## name height mass ## <chr> <int> <dbl> ## 1 Luke Skywalker 172 77 ## 2 C-3PO 167 75 ## 3 R2-D2 96 32 ## 4 Darth Vader 202 136 ## 5 Leia Organa 150 49 ## 6 Owen Lars 178 120 ## 7 Beru Whitesun lars 165 75 ## 8 R5-D4 97 32 ## 9 Biggs Darklighter 183 84 ## 10 Obi-Wan Kenobi 182 77 ## # … with 77 more rows ``` ] .pull-right[ ``` ## # A tibble: 5 x 2 ## gender avg_height ## <chr> <dbl> ## 1 female 165. ## 2 hermaphrodite 175 ## 3 male 179. ## 4 none 200 ## 5 <NA> 120 ``` ] ] --- ## Displaying data ```r starwars %>% select(name, height, mass) ``` ## Summarizing data ```r starwars %>% group_by(gender) %>% summarize( avg_height = mean(height, na.rm = TRUE) %>% round(2) ) ``` --- class: center, middle # Data wrangling and summarizing # with **dplyr** --- ## A grammar of data wrangling... ... based on the concepts of functions as verbs that manipulate data frames .pull-left[ <img src="img/dplyr-part-of-tidyverse.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ .midi[ - `select`: pick columns by name - `arrange`: reorder rows - `slice`: pick rows using index(es) - `sample_n` / `sample_frac`: randomly sample rows - `filter`: pick rows matching criteria - `distinct`: filter for unique rows - `mutate`: add new variables - `summarise`: reduce variables to values - `pull`: grab a column as a vector - ... (many more) ] ] --- ## Rules of **dplyr** functions - First argument is *always* a data frame - Subsequent arguments say what to do with that data frame - Always return a data frame - Don't modify in place --- ## Bike crashes in NC 2007 - 2014 ```r ncbikecrash <- read_csv("data/ncbikecrash.csv") ``` ```r glimpse(ncbikecrash) ``` ``` ## Observations: 7,467 ## Variables: 53 ## $ object_id <dbl> 1686, 1674, 1673, 1687, 1653, 1665, 1642, 1… ## $ city <chr> "None - Rural Crash", "Henderson", "None - … ## $ county <chr> "Wayne", "Vance", "Lincoln", "Columbus", "N… ## $ region <chr> "Coastal", "Piedmont", "Piedmont", "Coastal… ## $ development <chr> "Farms, Woods, Pastures", "Residential", "F… ## $ locality <chr> "Rural (<30% Developed)", "Mixed (30% To 70… ## $ on_road <chr> "SR 1915", "NICHOLAS ST", "US 321", "W BURK… ## $ rural_urban <chr> "Rural", "Urban", "Rural", "Urban", "Urban"… ## $ speed_limit <chr> "50 - 55 MPH", "30 - 35 MPH", "50 - 55 M… ## $ traffic_control <chr> "No Control Present", "Stop Sign", "Double … ## $ weather <chr> "Clear", "Clear", "Clear", "Rain", "Clear",… ## $ workzone <chr> "No", "No", "No", "No", "No", "No", "No", "… ## $ bike_age <chr> "52", "66", "33", "52", "22", "15", "41", "… ## $ bike_age_group <chr> "50-59", "60-69", "30-39", "50-59", "20-24"… ## $ bike_alcohol <chr> "No", "No", "No", "Yes", "No", "No", "No", … ## $ bike_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… ## $ bike_direction <chr> "With Traffic", "With Traffic", "With Traff… ## $ bike_injury <chr> "B: Evident Injury", "C: Possible Injury", … ## $ bike_position <chr> "Bike Lane / Paved Shoulder", "Travel Lane"… ## $ bike_race <chr> "Black", "Black", "White", "Black", "White"… ## $ bike_sex <chr> "Male", "Male", "Male", "Male", "Female", "… ## $ driver_age <chr> "34", NA, "37", "55", "25", "17", NA, "50",… ## $ driver_age_group <chr> "30-39", NA, "30-39", "50-59", "25-29", "0-… ## $ driver_alcohol <chr> "No", "Missing", "No", "No", "No", "No", "M… ## $ driver_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… ## $ driver_est_speed <chr> "51-55 mph", "6-10 mph", "41-45 mph", "11-1… ## $ driver_injury <chr> "O: No Injury", "Unknown Injury", "O: No In… ## $ driver_race <chr> "White", "Unknown/Missing", "Hispanic", "Bl… ## $ driver_sex <chr> "Male", NA, "Female", "Male", "Male", "Fema… ## $ driver_vehicle_type <chr> "Single Unit Truck (2-Axle, 6-Tire)", NA, "… ## $ crash_alcohol <chr> "No", "No", "No", "Yes", "No", "No", "No", … ## $ crash_date <chr> "11DEC2013", "20NOV2013", "03NOV2013", "14D… ## $ crash_day <chr> "Wednesday", "Wednesday", "Sunday", "Saturd… ## $ crash_group <chr> "Motorist Overtaking Bicyclist", "Bicyclist… ## $ crash_hour <dbl> 6, 20, 18, 18, 13, 17, 17, 7, 15, 2, 12, 22… ## $ crash_location <chr> "Non-Intersection", "Intersection", "Non-In… ## $ crash_month <chr> "December", "November", "November", "Decemb… ## $ crash_severity <chr> "B: Evident Injury", "C: Possible Injury", … ## $ crash_time <time> 06:10:00, 20:41:00, 18:05:00, 18:34:00, 13… ## $ crash_type <chr> "Motorist Overtaking - Undetected Bicyclist… ## $ crash_year <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2… ## $ ambulance_req <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Y… ## $ hit_run <chr> "No", "Yes", "No", "No", "No", "No", "Yes",… ## $ light_condition <chr> "Dark - Roadway Not Lighted", NA, "Dark - R… ## $ road_character <chr> "Straight - Level", "Straight - Level", "St… ## $ road_class <chr> "State Secondary Route", "Local Street", "U… ## $ road_condition <chr> "Dry", "Dry", "Dry", "Water (Standing, Movi… ## $ road_configuration <chr> "Two-Way, Not Divided", "Two-Way, Divided, … ## $ road_defects <chr> "None", NA, "None", "None", "None", "None",… ## $ road_feature <chr> "No Special Feature", "T-Intersection", "No… ## $ road_surface <chr> "Coarse Asphalt", "Smooth Asphalt", "Smooth… ## $ num_lanes <chr> "2 lanes", "2 lanes", "2 lanes", "1 lane", … ## $ geo_point <chr> "35.3336070056, -77.9955023901", "36.315187… ``` --- ## Variables View the names of variables via ```r names(ncbikecrash) ``` ``` ## [1] "object_id" "city" "county" ## [4] "region" "development" "locality" ## [7] "on_road" "rural_urban" "speed_limit" ## [10] "traffic_control" "weather" "workzone" ## [13] "bike_age" "bike_age_group" "bike_alcohol" ## [16] "bike_alcohol_drugs" "bike_direction" "bike_injury" ## [19] "bike_position" "bike_race" "bike_sex" ## [22] "driver_age" "driver_age_group" "driver_alcohol" ## [25] "driver_alcohol_drugs" "driver_est_speed" "driver_injury" ## [28] "driver_race" "driver_sex" "driver_vehicle_type" ## [31] "crash_alcohol" "crash_date" "crash_day" ## [34] "crash_group" "crash_hour" "crash_location" ## [37] "crash_month" "crash_severity" "crash_time" ## [40] "crash_type" "crash_year" "ambulance_req" ## [43] "hit_run" "light_condition" "road_character" ## [46] "road_class" "road_condition" "road_configuration" ## [49] "road_defects" "road_feature" "road_surface" ## [52] "num_lanes" "geo_point" ``` and see detailed descriptions [here](https://introds.org/hw/hw-02/hw-02-bike-crash.html). --- ## Select columns .midi[ ```r select(ncbikecrash, county, bike_age) ``` ``` ## # A tibble: 7,467 x 2 ## county bike_age ## <chr> <chr> ## 1 Wayne 52 ## 2 Vance 66 ## 3 Lincoln 33 ## 4 Columbus 52 ## 5 New Hanover 22 ## 6 Robeson 15 ## 7 Richmond 41 ## 8 Wake 14 ## 9 Columbus 16 ## 10 Craven 54 ## # … with 7,457 more rows ``` ] -- .question[ What if we wanted to select these columns, and then arrange the data in ascending order of biker age? ] --- ## Data wrangling, step-by-step .midi[ .pull-left[ Select: ```r ncbikecrash %>% select(county, bike_age) ``` ``` ## # A tibble: 7,467 x 2 ## county bike_age ## <chr> <chr> ## 1 Wayne 52 ## 2 Vance 66 ## 3 Lincoln 33 ## 4 Columbus 52 ## 5 New Hanover 22 ## 6 Robeson 15 ## 7 Richmond 41 ## 8 Wake 14 ## 9 Columbus 16 ## 10 Craven 54 ## # … with 7,457 more rows ``` ] .pull-right[ Select, then arrange: ```r ncbikecrash %>% select(county, bike_age) %>% arrange(bike_age) ``` ``` ## # A tibble: 7,467 x 2 ## county bike_age ## <chr> <chr> ## 1 New Hanover 0 ## 2 Carteret 1 ## 3 Guilford 1 ## 4 Pitt 10 ## 5 Cumberland 10 ## 6 Carteret 10 ## 7 Hoke 10 ## 8 Martin 10 ## 9 New Hanover 10 ## 10 Onslow 10 ## # … with 7,457 more rows ``` ] ] --- class: center, middle # Pipes --- ## What is a pipe? In programming, a pipe is a technique for passing information from one process to another. -- .pull-left[ - Start with the data frame `ncbikecrash`, and pass it to the `select()` function, ] .pull-right[ ```r *ncbikecrash %>% select(county, bike_age) %>% arrange(bike_age) ``` ``` ## # A tibble: 7,467 x 2 ## county bike_age ## <chr> <chr> ## 1 New Hanover 0 ## 2 Carteret 1 ## 3 Guilford 1 ## 4 Pitt 10 ## 5 Cumberland 10 ## 6 Carteret 10 ## 7 Hoke 10 ## 8 Martin 10 ## 9 New Hanover 10 ## 10 Onslow 10 ## # … with 7,457 more rows ``` ] --- ## What is a pipe? In programming, a pipe is a technique for passing information from one process to another. .pull-left[ - Start with the data frame `ncbikecrash`, and pass it to the `select()` function, - then we select the variables `county` and `bike_age`, ] .pull-right[ ```r ncbikecrash %>% * select(county, bike_age) %>% arrange(bike_age) ``` ``` ## # A tibble: 7,467 x 2 ## county bike_age ## <chr> <chr> ## 1 New Hanover 0 ## 2 Carteret 1 ## 3 Guilford 1 ## 4 Pitt 10 ## 5 Cumberland 10 ## 6 Carteret 10 ## 7 Hoke 10 ## 8 Martin 10 ## 9 New Hanover 10 ## 10 Onslow 10 ## # … with 7,457 more rows ``` ] --- ## What is a pipe? In programming, a pipe is a technique for passing information from one process to another. .pull-left[ - Start with the data frame `ncbikecrash`, and pass it to the `select()` function, - then we select the variables `county` and `bike_age`, - and then we arrange the data frame by `bike_age` in ascending order. ] .pull-right[ ```r ncbikecrash %>% select(county, bike_age) %>% * arrange(bike_age) ``` ``` ## # A tibble: 7,467 x 2 ## county bike_age ## <chr> <chr> ## 1 New Hanover 0 ## 2 Carteret 1 ## 3 Guilford 1 ## 4 Pitt 10 ## 5 Cumberland 10 ## 6 Carteret 10 ## 7 Hoke 10 ## 8 Martin 10 ## 9 New Hanover 10 ## 10 Onslow 10 ## # … with 7,457 more rows ``` ] --- ## Aside The pipe operator is implemented in the package **magrittr**, though we don't need to load this package explicitly since **tidyverse** does this for us. <br> -- .question[ Any guesses as to why the package is called magrittr? ] -- .pull-left[ <img src="img/magritte.jpg" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="img/magrittr.jpg" width="100%" style="display: block; margin: auto;" /> ] --- ## How does a pipe work? - You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park. - Expressed as a set of nested functions in R pseudocode this would look like: ```r park(drive(start_car(find("keys")), to = "campus")) ``` - Writing it out using pipes give it a more natural (and easier to read) structure: ```r find("keys") %>% start_car() %>% drive(to = "campus") %>% park() ``` --- ## What about other arguments? Use the dot to - send results to a function argument other than first one or - use the previous result for multiple arguments ```r starwars %>% filter(species == "Human") %>% * lm(mass ~ height, data = .) ``` ``` ## ## Call: ## lm(formula = mass ~ height, data = .) ## ## Coefficients: ## (Intercept) height ## -116.58 1.11 ``` --- ## A note on piping and layering - The `%>%` operator in **dplyr** functions is called the pipe operator. This means you "pipe" the output of the previous line of code as the first input of the next line of code. - The `+` operator in **ggplot2** functions is used for "layering". This means you create the plot in layers, separated by `+`. - Many of the styling principles are consistent across `%>%` and `+`: - always a space before - always a line break after (for pipelines with more than 2 lines) --- class: center, middle # Data wrangling with dplyr --- ## `select` to keep variables ```r ncbikecrash %>% * select(locality, speed_limit) ``` ``` ## # A tibble: 7,467 x 2 ## locality speed_limit ## <chr> <chr> ## 1 Rural (<30% Developed) 50 - 55 MPH ## 2 Mixed (30% To 70% Developed) 30 - 35 MPH ## 3 Rural (<30% Developed) 50 - 55 MPH ## 4 Urban (>70% Developed) 30 - 35 MPH ## 5 Urban (>70% Developed) <NA> ## 6 Rural (<30% Developed) 50 - 55 MPH ## 7 Mixed (30% To 70% Developed) 30 - 35 MPH ## 8 Urban (>70% Developed) 30 - 35 MPH ## 9 Rural (<30% Developed) 30 - 35 MPH ## 10 Urban (>70% Developed) 20 - 25 MPH ## # … with 7,457 more rows ``` --- ## `select` to exclude variables ```r ncbikecrash %>% * select(-object_id) ``` ``` ## # A tibble: 7,467 x 52 ## city county region development locality on_road rural_urban speed_limit ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 None… Wayne Coast… Farms, Woo… Rural (… SR 1915 Rural 50 - 55 M… ## 2 Hend… Vance Piedm… Residential Mixed (… NICHOL… Urban 30 - 35 M… ## 3 None… Linco… Piedm… Farms, Woo… Rural (… US 321 Rural 50 - 55 M… ## 4 Whit… Colum… Coast… Commercial Urban (… W BURK… Urban 30 - 35 M… ## 5 Wilm… New H… Coast… Residential Urban (… RACINE… Urban <NA> ## 6 None… Robes… Coast… Farms, Woo… Rural (… SR 1513 Rural 50 - 55 M… ## 7 None… Richm… Piedm… Residential Mixed (… SR 1903 Rural 30 - 35 M… ## 8 Rale… Wake Piedm… Commercial Urban (… PERSON… Urban 30 - 35 M… ## 9 Whit… Colum… Coast… Residential Rural (… FLOWER… Urban 30 - 35 M… ## 10 New … Craven Coast… Residential Urban (… SUTTON… Urban 20 - 25 M… ## # … with 7,457 more rows, and 44 more variables: traffic_control <chr>, ## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>, ## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>, ## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, ## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>, ## # driver_alcohol <chr>, driver_alcohol_drugs <chr>, ## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>, ## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>, ## # crash_date <chr>, crash_day <chr>, crash_group <chr>, ## # crash_hour <dbl>, crash_location <chr>, crash_month <chr>, ## # crash_severity <chr>, crash_time <time>, crash_type <chr>, ## # crash_year <dbl>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_lanes <chr>, ## # geo_point <chr> ``` --- ## `select` a range of variables ```r ncbikecrash %>% * select(city:locality) ``` ``` ## # A tibble: 7,467 x 5 ## city county region development locality ## <chr> <chr> <chr> <chr> <chr> ## 1 None - Rural … Wayne Coastal Farms, Woods, Pa… Rural (<30% Develop… ## 2 Henderson Vance Piedmo… Residential Mixed (30% To 70% D… ## 3 None - Rural … Lincoln Piedmo… Farms, Woods, Pa… Rural (<30% Develop… ## 4 Whiteville Columbus Coastal Commercial Urban (>70% Develop… ## 5 Wilmington New Hanov… Coastal Residential Urban (>70% Develop… ## 6 None - Rural … Robeson Coastal Farms, Woods, Pa… Rural (<30% Develop… ## 7 None - Rural … Richmond Piedmo… Residential Mixed (30% To 70% D… ## 8 Raleigh Wake Piedmo… Commercial Urban (>70% Develop… ## 9 Whiteville Columbus Coastal Residential Rural (<30% Develop… ## 10 New Bern Craven Coastal Residential Urban (>70% Develop… ## # … with 7,457 more rows ``` --- ## `select` variables with certain characteristics ```r ncbikecrash %>% * select(starts_with("bike_")) ``` ``` ## # A tibble: 7,467 x 9 ## bike_age bike_age_group bike_alcohol bike_alcohol_dr… bike_direction ## <chr> <chr> <chr> <chr> <chr> ## 1 52 50-59 No <NA> With Traffic ## 2 66 60-69 No <NA> With Traffic ## 3 33 30-39 No <NA> With Traffic ## 4 52 50-59 Yes <NA> <NA> ## 5 22 20-24 No <NA> Facing Traffic ## 6 15 11-15 No <NA> With Traffic ## 7 41 40-49 No <NA> Facing Traffic ## 8 14 11-15 No <NA> <NA> ## 9 16 16-19 No <NA> Facing Traffic ## 10 54 50-59 No <NA> With Traffic ## # … with 7,457 more rows, and 4 more variables: bike_injury <chr>, ## # bike_position <chr>, bike_race <chr>, bike_sex <chr> ``` --- ## `select` variables with certain characteristics ```r ncbikecrash %>% * select(ends_with("age")) ``` ``` ## # A tibble: 7,467 x 2 ## bike_age driver_age ## <chr> <chr> ## 1 52 34 ## 2 66 <NA> ## 3 33 37 ## 4 52 55 ## 5 22 25 ## 6 15 17 ## 7 41 <NA> ## 8 14 50 ## 9 16 32 ## 10 54 69 ## # … with 7,457 more rows ``` --- ## Select helpers - `starts_with()`: Starts with a prefix - `ends_with()`: Ends with a suffix - `contains()`: Contains a literal string - `num_range()`: Matches a numerical range like x01, x02, x03 - `one_of()`: Matches variable names in a character vector - `everything()`: Matches all variables - `last_col()`: Select last variable, possibly with an offset - `matches()`: Matches a regular expression (a sequence of symbols/characters expressing a string/pattern to be searched for within text) .footnote[ See help for any of these functions for more info, e.g. `?everything`. ] --- ## `arrange` in ascending / descending order .pull-left[ ```r ncbikecrash %>% select(ends_with("age")) %>% * arrange(bike_age) ``` ``` ## # A tibble: 7,467 x 2 ## bike_age driver_age ## <chr> <chr> ## 1 0 47 ## 2 1 70+ ## 3 1 61 ## 4 10 30 ## 5 10 19 ## 6 10 22 ## 7 10 18 ## 8 10 27 ## 9 10 53 ## 10 10 <NA> ## # … with 7,457 more rows ``` ] .pull-right[ ```r ncbikecrash %>% select(ends_with("age")) %>% * arrange(desc(bike_age)) ``` ``` ## # A tibble: 7,467 x 2 ## bike_age driver_age ## <chr> <chr> ## 1 9 23 ## 2 9 35 ## 3 9 70+ ## 4 9 41 ## 5 9 53 ## 6 9 18 ## 7 9 45 ## 8 9 19 ## 9 9 70+ ## 10 9 59 ## # … with 7,457 more rows ``` ] --- ## `slice` for certain row numbers First five ```r ncbikecrash %>% * slice(1:5) ``` ``` ## # A tibble: 5 x 53 ## object_id city county region development locality on_road rural_urban ## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 1686 None… Wayne Coast… Farms, Woo… Rural (… SR 1915 Rural ## 2 1674 Hend… Vance Piedm… Residential Mixed (… NICHOL… Urban ## 3 1673 None… Linco… Piedm… Farms, Woo… Rural (… US 321 Rural ## 4 1687 Whit… Colum… Coast… Commercial Urban (… W BURK… Urban ## 5 1653 Wilm… New H… Coast… Residential Urban (… RACINE… Urban ## # … with 45 more variables: speed_limit <chr>, traffic_control <chr>, ## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>, ## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>, ## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, ## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>, ## # driver_alcohol <chr>, driver_alcohol_drugs <chr>, ## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>, ## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>, ## # crash_date <chr>, crash_day <chr>, crash_group <chr>, ## # crash_hour <dbl>, crash_location <chr>, crash_month <chr>, ## # crash_severity <chr>, crash_time <time>, crash_type <chr>, ## # crash_year <dbl>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_lanes <chr>, ## # geo_point <chr> ``` --- ## `slice` for certain row numbers Last five ```r last_row <- nrow(ncbikecrash) ncbikecrash %>% * slice((last_row - 4):last_row) ``` ``` ## # A tibble: 5 x 53 ## object_id city county region development locality on_road rural_urban ## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 6989 High… Guilf… Piedm… Residential Urban (… <NA> Urban ## 2 6991 Wilm… New H… Coast… Residential Urban (… <NA> Urban ## 3 6995 Kins… Lenoir Coast… Commercial Urban (… <NA> Urban ## 4 6998 Faye… Cumbe… Coast… Residential Urban (… <NA> Urban ## 5 7000 None… Onslow Coast… Farms, Woo… Rural (… <NA> Rural ## # … with 45 more variables: speed_limit <chr>, traffic_control <chr>, ## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>, ## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>, ## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, ## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>, ## # driver_alcohol <chr>, driver_alcohol_drugs <chr>, ## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>, ## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>, ## # crash_date <chr>, crash_day <chr>, crash_group <chr>, ## # crash_hour <dbl>, crash_location <chr>, crash_month <chr>, ## # crash_severity <chr>, crash_time <time>, crash_type <chr>, ## # crash_year <dbl>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_lanes <chr>, ## # geo_point <chr> ``` --- ## `sample_n` / `sample_frac` for a random sample - `sample_n`: randomly sample 5 observations ```r ncbikecrash_n5 <- ncbikecrash %>% * sample_n(5, replace = FALSE) dim(ncbikecrash_n5) ``` ``` ## [1] 5 53 ``` -- - `sample_frac`: randomly sample 20% of observations ```r ncbikecrash_perc20 <-ncbikecrash %>% * sample_frac(0.2, replace = FALSE) dim(ncbikecrash_perc20) ``` ``` ## [1] 1493 53 ``` --- ## `filter` to select a subset of rows Crashes in Durham County ```r ncbikecrash %>% * filter(county == "Durham") ``` ``` ## # A tibble: 340 x 53 ## object_id city county region development locality on_road rural_urban ## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 2452 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## 2 2441 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 3 2466 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 4 549 Durh… Durham Piedm… Residential Urban (… PARK A… Urban ## 5 598 Durh… Durham Piedm… Residential Urban (… BELT S… Urban ## 6 603 Durh… Durham Piedm… Residential Urban (… HINSON… Urban ## 7 3974 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 8 7134 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 9 1670 Durh… Durham Piedm… Commercial Urban (… INFINI… Urban ## 10 1773 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## # … with 330 more rows, and 45 more variables: speed_limit <chr>, ## # traffic_control <chr>, weather <chr>, workzone <chr>, bike_age <chr>, ## # bike_age_group <chr>, bike_alcohol <chr>, bike_alcohol_drugs <chr>, ## # bike_direction <chr>, bike_injury <chr>, bike_position <chr>, ## # bike_race <chr>, bike_sex <chr>, driver_age <chr>, ## # driver_age_group <chr>, driver_alcohol <chr>, ## # driver_alcohol_drugs <chr>, driver_est_speed <chr>, ## # driver_injury <chr>, driver_race <chr>, driver_sex <chr>, ## # driver_vehicle_type <chr>, crash_alcohol <chr>, crash_date <chr>, ## # crash_day <chr>, crash_group <chr>, crash_hour <dbl>, ## # crash_location <chr>, crash_month <chr>, crash_severity <chr>, ## # crash_time <time>, crash_type <chr>, crash_year <dbl>, ## # ambulance_req <chr>, hit_run <chr>, light_condition <chr>, ## # road_character <chr>, road_class <chr>, road_condition <chr>, ## # road_configuration <chr>, road_defects <chr>, road_feature <chr>, ## # road_surface <chr>, num_lanes <chr>, geo_point <chr> ``` --- ## `filter` for many conditions at once Crashes in Durham County where biker is 0-5 years old ```r ncbikecrash %>% * filter( * county == "Durham", * bike_age_group == "0-5" * ) ``` ``` ## # A tibble: 4 x 53 ## object_id city county region development locality on_road rural_urban ## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 4062 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## 2 414 Durh… Durham Piedm… Residential Urban (… PVA 90… Urban ## 3 3016 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## 4 1383 Durh… Durham Piedm… Residential Urban (… PVA 62… Urban ## # … with 45 more variables: speed_limit <chr>, traffic_control <chr>, ## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>, ## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>, ## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, ## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>, ## # driver_alcohol <chr>, driver_alcohol_drugs <chr>, ## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>, ## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>, ## # crash_date <chr>, crash_day <chr>, crash_group <chr>, ## # crash_hour <dbl>, crash_location <chr>, crash_month <chr>, ## # crash_severity <chr>, crash_time <time>, crash_type <chr>, ## # crash_year <dbl>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_lanes <chr>, ## # geo_point <chr> ``` --- ## Logical operators in R operator | definition || operator | definition ------------|------------------------------||--------------|---------------- `<` | less than ||`x` | `y` | `x` OR `y` `<=` | less than or equal to ||`is.na(x)` | test if `x` is `NA` `>` | greater than ||`!is.na(x)` | test if `x` is not `NA` `>=` | greater than or equal to ||`x %in% y` | test if `x` is in `y` `==` | exactly equal to ||`!(x %in% y)` | test if `x` is not in `y` `!=` | not equal to ||`!x` | not `x` `x & y` | `x` AND `y` || | --- .question[ Fill in the blanks for filtering for crashes **not** in Durham County where crash year is after 2014 and `bike_position` is not `NA`. ] ```r ncbikecrash %>% filter( county ____ "Durham", crash_year ____ 2014, ____ ) ``` --- .question[ Fill in the blanks for filtering for crashes **not** in Durham County where crash year is after 2014 and `bike_position` is not `NA`. ] ```r ncbikecrash %>% filter( * county != "Durham", * crash_year > 2014, * !is.na(bike_position) ) ``` ``` ## # A tibble: 0 x 53 ## # … with 53 variables: object_id <dbl>, city <chr>, county <chr>, ## # region <chr>, development <chr>, locality <chr>, on_road <chr>, ## # rural_urban <chr>, speed_limit <chr>, traffic_control <chr>, ## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>, ## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>, ## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, ## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>, ## # driver_alcohol <chr>, driver_alcohol_drugs <chr>, ## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>, ## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>, ## # crash_date <chr>, crash_day <chr>, crash_group <chr>, ## # crash_hour <dbl>, crash_location <chr>, crash_month <chr>, ## # crash_severity <chr>, crash_time <time>, crash_type <chr>, ## # crash_year <dbl>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_lanes <chr>, ## # geo_point <chr> ``` --- ## `distinct` to filter for unique rows ... and `arrange` to order alphabetically .pull-left[ ```r ncbikecrash %>% * distinct(county) %>% arrange(county) ``` ``` ## # A tibble: 101 x 1 ## county ## <chr> ## 1 Alamance ## 2 Alexander ## 3 Alleghany ## 4 Anson ## 5 Ashe ## 6 Avery ## 7 Beaufort ## 8 Bertie ## 9 Bladen ## 10 Brunswick ## # … with 91 more rows ``` ] .pull-right[ ```r ncbikecrash %>% select(county, city) %>% * distinct() %>% arrange(county, city) ``` ``` ## # A tibble: 391 x 2 ## county city ## <chr> <chr> ## 1 Alamance Alamance ## 2 Alamance Burlington ## 3 Alamance Elon ## 4 Alamance Elon College ## 5 Alamance Gibsonville ## 6 Alamance Graham ## 7 Alamance Green Level ## 8 Alamance Mebane ## 9 Alamance None - Rural Crash ## 10 Alexander None - Rural Crash ## # … with 381 more rows ``` ]