Tidy data and data wrangling 🔧

# Tidy data and data wrangling <br> 🔧
### Dr. Çetinkaya-Rundel

---

layout: true
  
<div class="my-footer">
<span>
Dr. Mine Çetinkaya-Rundel -
<a href="https://introds.org" target="_blank">introds.org
</a>
</span>
</div>

---

## Week 3

- Preparing for tomorrow's workshop: Check your email tomorrow morning for your lab team assignment, find your teammates when you get in the workshop classroom, and sit together.
- Student hours back to usual time: Tuesdays 14:30 - 16:30
- Piazza: Code as reproducible example + Verbatim error + Informative title 
- New section on website with links to cheat sheets + more...

---

# Coding style

---

## Style guide

>"Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread."
>
>Hadley Wickham

- Style guide for this course is based on the Tidyverse style guide: http://style.tidyverse.org/
- There's more to it than what we'll cover today, but we'll mention more as we 
introduce more functionality, and do a recap later in the semester

---

## File names and code chunk labels

- Do not use spaces in file names and code chunk labels, use `-` or `_` to 
separate words
- Use all lowercase letters

```r
# Good
ucb-admit.csv

# Bad
UCB Admit.csv
```

---

## Object names

- Use `_` to separate words in object names
- Use informative but short object names
- Do not reuse object names within an analysis

```r
# Good
acs_employed

# Bad
acs.employed
acs2
acs_subset
acs_subsetted_for_males
```

---

## Spacing

- Put a space before and after all infix operators (=, +, -, <-, etc.), and when 
naming arguments in function calls
- Always put a space after a comma, and never before (just like in regular English)

```r
# Good
average <- mean(feet / 12 + inches, na.rm = TRUE)

# Bad
average<-mean(feet/12+inches,na.rm=TRUE)
```

---

## ggplot2

- Always end a line with `+`
- Always indent the next line

```r
# Good
ggplot(diamonds, mapping = aes(x = price)) +
  geom_histogram()

# Bad
ggplot(diamonds,mapping=aes(x=price))+geom_histogram()
```

---

# Tidy data

---

## Tidy data

>Happy families are all alike; every unhappy family is unhappy in its own way. 
>
>Leo Tolstoy

- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
]
--
.pull-right[
**Characteristics of untidy data:**

!@#$%^&*()
]

---

.footnote[
Source: [Army Air Forces Statistical Digest, WW II](https://www.ibiblio.org/hyperwar/AAF/StatDigest/aafsd-3.html)
]

---

<br>

.footnote[
Source: [Gapminder, Estimated HIV prevalence among 15-49 year olds](https://www.gapminder.org/data)
]

---

<br>

.footnote[
Source: [US Census Fact Finder, General Economic Characteristics, ACS 2017](https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_DP03&src=pt)
]

---

## Summary tables

```
## # A tibble: 87 x 3
##    name               height  mass
##    <chr>               <int> <dbl>
##  1 Luke Skywalker        172    77
##  2 C-3PO                 167    75
##  3 R2-D2                  96    32
##  4 Darth Vader           202   136
##  5 Leia Organa           150    49
##  6 Owen Lars             178   120
##  7 Beru Whitesun lars    165    75
##  8 R5-D4                  97    32
##  9 Biggs Darklighter     183    84
## 10 Obi-Wan Kenobi        182    77
## # … with 77 more rows
```
]
.pull-right[

```
## # A tibble: 5 x 2
##   gender        avg_height
##   <chr>              <dbl>
## 1 female              165.
## 2 hermaphrodite       175 
## 3 male                179.
## 4 none                200 
## 5 <NA>                120
```
]
]

---

## Displaying data

```r
starwars %>%
  select(name, height, mass)
```

## Summarizing data

```r
starwars %>%
  group_by(gender) %>%
  summarize(
    avg_height = mean(height, na.rm = TRUE) %>% round(2)
  )
```

---

# Data wrangling and summarizing 
# with **dplyr**

---

## A grammar of data wrangling...

... based on the concepts of functions as verbs that manipulate data frames

.pull-left[
<img src="img/dplyr-part-of-tidyverse.png" width="80%" style="display: block; margin: auto;" />
]
.pull-right[
.midi[
- `select`: pick columns by name
- `arrange`: reorder rows
- `slice`: pick rows using index(es)
- `sample_n` / `sample_frac`: randomly sample rows
- `filter`: pick rows matching criteria
- `distinct`: filter for unique rows
- `mutate`: add new variables
- `summarise`: reduce variables to values
- `pull`: grab a column as a vector
- ... (many more)
]
]

---

## Rules of **dplyr** functions

- First argument is *always* a data frame
- Subsequent arguments say what to do with that data frame
- Always return a data frame
- Don't modify in place

---

## Bike crashes in NC 2007 - 2014

```r
ncbikecrash <- read_csv("data/ncbikecrash.csv")
```

```r
glimpse(ncbikecrash)
```

```
## Observations: 7,467
## Variables: 53
## $ object_id            <dbl> 1686, 1674, 1673, 1687, 1653, 1665, 1642, 1…
## $ city                 <chr> "None - Rural Crash", "Henderson", "None - …
## $ county               <chr> "Wayne", "Vance", "Lincoln", "Columbus", "N…
## $ region               <chr> "Coastal", "Piedmont", "Piedmont", "Coastal…
## $ development          <chr> "Farms, Woods, Pastures", "Residential", "F…
## $ locality             <chr> "Rural (<30% Developed)", "Mixed (30% To 70…
## $ on_road              <chr> "SR 1915", "NICHOLAS ST", "US 321", "W BURK…
## $ rural_urban          <chr> "Rural", "Urban", "Rural", "Urban", "Urban"…
## $ speed_limit          <chr> "50 - 55  MPH", "30 - 35  MPH", "50 - 55  M…
## $ traffic_control      <chr> "No Control Present", "Stop Sign", "Double …
## $ weather              <chr> "Clear", "Clear", "Clear", "Rain", "Clear",…
## $ workzone             <chr> "No", "No", "No", "No", "No", "No", "No", "…
## $ bike_age             <chr> "52", "66", "33", "52", "22", "15", "41", "…
## $ bike_age_group       <chr> "50-59", "60-69", "30-39", "50-59", "20-24"…
## $ bike_alcohol         <chr> "No", "No", "No", "Yes", "No", "No", "No", …
## $ bike_alcohol_drugs   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ bike_direction       <chr> "With Traffic", "With Traffic", "With Traff…
## $ bike_injury          <chr> "B: Evident Injury", "C: Possible Injury", …
## $ bike_position        <chr> "Bike Lane / Paved Shoulder", "Travel Lane"…
## $ bike_race            <chr> "Black", "Black", "White", "Black", "White"…
## $ bike_sex             <chr> "Male", "Male", "Male", "Male", "Female", "…
## $ driver_age           <chr> "34", NA, "37", "55", "25", "17", NA, "50",…
## $ driver_age_group     <chr> "30-39", NA, "30-39", "50-59", "25-29", "0-…
## $ driver_alcohol       <chr> "No", "Missing", "No", "No", "No", "No", "M…
## $ driver_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ driver_est_speed     <chr> "51-55 mph", "6-10 mph", "41-45 mph", "11-1…
## $ driver_injury        <chr> "O: No Injury", "Unknown Injury", "O: No In…
## $ driver_race          <chr> "White", "Unknown/Missing", "Hispanic", "Bl…
## $ driver_sex           <chr> "Male", NA, "Female", "Male", "Male", "Fema…
## $ driver_vehicle_type  <chr> "Single Unit Truck (2-Axle, 6-Tire)", NA, "…
## $ crash_alcohol        <chr> "No", "No", "No", "Yes", "No", "No", "No", …
## $ crash_date           <chr> "11DEC2013", "20NOV2013", "03NOV2013", "14D…
## $ crash_day            <chr> "Wednesday", "Wednesday", "Sunday", "Saturd…
## $ crash_group          <chr> "Motorist Overtaking Bicyclist", "Bicyclist…
## $ crash_hour           <dbl> 6, 20, 18, 18, 13, 17, 17, 7, 15, 2, 12, 22…
## $ crash_location       <chr> "Non-Intersection", "Intersection", "Non-In…
## $ crash_month          <chr> "December", "November", "November", "Decemb…
## $ crash_severity       <chr> "B: Evident Injury", "C: Possible Injury", …
## $ crash_time           <time> 06:10:00, 20:41:00, 18:05:00, 18:34:00, 13…
## $ crash_type           <chr> "Motorist Overtaking - Undetected Bicyclist…
## $ crash_year           <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ ambulance_req        <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Y…
## $ hit_run              <chr> "No", "Yes", "No", "No", "No", "No", "Yes",…
## $ light_condition      <chr> "Dark - Roadway Not Lighted", NA, "Dark - R…
## $ road_character       <chr> "Straight - Level", "Straight - Level", "St…
## $ road_class           <chr> "State Secondary Route", "Local Street", "U…
## $ road_condition       <chr> "Dry", "Dry", "Dry", "Water (Standing, Movi…
## $ road_configuration   <chr> "Two-Way, Not Divided", "Two-Way, Divided, …
## $ road_defects         <chr> "None", NA, "None", "None", "None", "None",…
## $ road_feature         <chr> "No Special Feature", "T-Intersection", "No…
## $ road_surface         <chr> "Coarse Asphalt", "Smooth Asphalt", "Smooth…
## $ num_lanes            <chr> "2 lanes", "2 lanes", "2 lanes", "1 lane", …
## $ geo_point            <chr> "35.3336070056, -77.9955023901", "36.315187…
```

---

## Variables

View the names of variables via

```r
names(ncbikecrash)
```

```
##  [1] "object_id"            "city"                 "county"              
##  [4] "region"               "development"          "locality"            
##  [7] "on_road"              "rural_urban"          "speed_limit"         
## [10] "traffic_control"      "weather"              "workzone"            
## [13] "bike_age"             "bike_age_group"       "bike_alcohol"        
## [16] "bike_alcohol_drugs"   "bike_direction"       "bike_injury"         
## [19] "bike_position"        "bike_race"            "bike_sex"            
## [22] "driver_age"           "driver_age_group"     "driver_alcohol"      
## [25] "driver_alcohol_drugs" "driver_est_speed"     "driver_injury"       
## [28] "driver_race"          "driver_sex"           "driver_vehicle_type" 
## [31] "crash_alcohol"        "crash_date"           "crash_day"           
## [34] "crash_group"          "crash_hour"           "crash_location"      
## [37] "crash_month"          "crash_severity"       "crash_time"          
## [40] "crash_type"           "crash_year"           "ambulance_req"       
## [43] "hit_run"              "light_condition"      "road_character"      
## [46] "road_class"           "road_condition"       "road_configuration"  
## [49] "road_defects"         "road_feature"         "road_surface"        
## [52] "num_lanes"            "geo_point"
```

and see detailed descriptions [here](https://introds.org/hw/hw-02/hw-02-bike-crash.html).

---

## Select columns

```r
select(ncbikecrash, county, bike_age)
```

```
## # A tibble: 7,467 x 2
##    county      bike_age
##    <chr>       <chr>   
##  1 Wayne       52      
##  2 Vance       66      
##  3 Lincoln     33      
##  4 Columbus    52      
##  5 New Hanover 22      
##  6 Robeson     15      
##  7 Richmond    41      
##  8 Wake        14      
##  9 Columbus    16      
## 10 Craven      54      
## # … with 7,457 more rows
```
]

.question[
What if we wanted to select these columns, and then arrange the data in ascending order of biker age?
]

---

## Data wrangling, step-by-step

```r
ncbikecrash %>%
  select(county, bike_age)
```

```r
ncbikecrash %>%
  select(county, bike_age) %>%
  arrange(bike_age)
```

```
## # A tibble: 7,467 x 2
##    county      bike_age
##    <chr>       <chr>   
##  1 New Hanover 0       
##  2 Carteret    1       
##  3 Guilford    1       
##  4 Pitt        10      
##  5 Cumberland  10      
##  6 Carteret    10      
##  7 Hoke        10      
##  8 Martin      10      
##  9 New Hanover 10      
## 10 Onslow      10      
## # … with 7,457 more rows
```
]
]

---

# Pipes

---

## What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

.pull-left[
- Start with the data frame `ncbikecrash`, and pass it to the `select()` function,
]
.pull-right[

```r
*ncbikecrash %>%
  select(county, bike_age) %>%
  arrange(bike_age)
```

---

## What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

.pull-left[
- Start with the data frame `ncbikecrash`, and pass it to the `select()` function,
- then we select the variables `county` and `bike_age`,
]
.pull-right[

```r
ncbikecrash %>%
* select(county, bike_age) %>%
  arrange(bike_age)
```

---

## What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

.pull-left[
- Start with the data frame `ncbikecrash`, and pass it to the `select()` function,
- then we select the variables `county` and `bike_age`,
- and then we arrange the data frame by `bike_age` in ascending order.
]
.pull-right[

```r
ncbikecrash %>%
  select(county, bike_age) %>%
* arrange(bike_age)
```

---

## Aside

The pipe operator is implemented in the package **magrittr**, though we don't need to load this package explicitly since **tidyverse** does this for us.

<br>

.pull-left[
<img src="img/magritte.jpg" width="100%" style="display: block; margin: auto;" />
]
.pull-right[
<img src="img/magrittr.jpg" width="100%" style="display: block; margin: auto;" />
]

---

## How does a pipe work?

- You can think about the following sequence of actions - find key, 
unlock car, start car, drive to school, park.
- Expressed as a set of nested functions in R pseudocode this would look like:

```r
park(drive(start_car(find("keys")), to = "campus"))
```
- Writing it out using pipes give it a more natural (and easier to read) 
structure:

```r
find("keys") %>%
  start_car() %>%
  drive(to = "campus") %>%
  park()
```

---

## What about other arguments?

Use the dot to

- send results to a function argument other than first one or 
- use the previous result for multiple arguments

```r
starwars %>%
  filter(species == "Human") %>%
* lm(mass ~ height, data = .)
```

```
## 
## Call:
## lm(formula = mass ~ height, data = .)
## 
## Coefficients:
## (Intercept)       height  
##     -116.58         1.11
```

---

## A note on piping and layering

- The `%>%` operator in **dplyr** functions is called the pipe operator. This means you "pipe" the output of the previous line of code as the first input of the next line of code.
- The `+` operator in **ggplot2** functions is used for "layering". This means you create the plot in layers, separated by `+`.
- Many of the styling principles are consistent across `%>%` and `+`:
  - always a space before
  - always a line break after (for pipelines with more than 2 lines)

---

# Data wrangling with dplyr

---

## `select` to keep variables

```r
ncbikecrash %>%
* select(locality, speed_limit)
```

```
## # A tibble: 7,467 x 2
##    locality                     speed_limit 
##    <chr>                        <chr>       
##  1 Rural (<30% Developed)       50 - 55  MPH
##  2 Mixed (30% To 70% Developed) 30 - 35  MPH
##  3 Rural (<30% Developed)       50 - 55  MPH
##  4 Urban (>70% Developed)       30 - 35  MPH
##  5 Urban (>70% Developed)       <NA>        
##  6 Rural (<30% Developed)       50 - 55  MPH
##  7 Mixed (30% To 70% Developed) 30 - 35  MPH
##  8 Urban (>70% Developed)       30 - 35  MPH
##  9 Rural (<30% Developed)       30 - 35  MPH
## 10 Urban (>70% Developed)       20 - 25  MPH
## # … with 7,457 more rows
```

---

## `select` to exclude variables

```r
ncbikecrash %>%
* select(-object_id)
```

```
## # A tibble: 7,467 x 52
##    city  county region development locality on_road rural_urban speed_limit
##    <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>       <chr>      
##  1 None… Wayne  Coast… Farms, Woo… Rural (… SR 1915 Rural       50 - 55  M…
##  2 Hend… Vance  Piedm… Residential Mixed (… NICHOL… Urban       30 - 35  M…
##  3 None… Linco… Piedm… Farms, Woo… Rural (… US 321  Rural       50 - 55  M…
##  4 Whit… Colum… Coast… Commercial  Urban (… W BURK… Urban       30 - 35  M…
##  5 Wilm… New H… Coast… Residential Urban (… RACINE… Urban       <NA>       
##  6 None… Robes… Coast… Farms, Woo… Rural (… SR 1513 Rural       50 - 55  M…
##  7 None… Richm… Piedm… Residential Mixed (… SR 1903 Rural       30 - 35  M…
##  8 Rale… Wake   Piedm… Commercial  Urban (… PERSON… Urban       30 - 35  M…
##  9 Whit… Colum… Coast… Residential Rural (… FLOWER… Urban       30 - 35  M…
## 10 New … Craven Coast… Residential Urban (… SUTTON… Urban       20 - 25  M…
## # … with 7,457 more rows, and 44 more variables: traffic_control <chr>,
## #   weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## #   bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## #   bike_injury <chr>, bike_position <chr>, bike_race <chr>,
## #   bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,
## #   driver_alcohol <chr>, driver_alcohol_drugs <chr>,
## #   driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,
## #   driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,
## #   crash_date <chr>, crash_day <chr>, crash_group <chr>,
## #   crash_hour <dbl>, crash_location <chr>, crash_month <chr>,
## #   crash_severity <chr>, crash_time <time>, crash_type <chr>,
## #   crash_year <dbl>, ambulance_req <chr>, hit_run <chr>,
## #   light_condition <chr>, road_character <chr>, road_class <chr>,
## #   road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## #   road_feature <chr>, road_surface <chr>, num_lanes <chr>,
## #   geo_point <chr>
```

---

## `select` a range of variables

```r
ncbikecrash %>%
* select(city:locality)
```

```
## # A tibble: 7,467 x 5
##    city           county     region  development       locality            
##    <chr>          <chr>      <chr>   <chr>             <chr>               
##  1 None - Rural … Wayne      Coastal Farms, Woods, Pa… Rural (<30% Develop…
##  2 Henderson      Vance      Piedmo… Residential       Mixed (30% To 70% D…
##  3 None - Rural … Lincoln    Piedmo… Farms, Woods, Pa… Rural (<30% Develop…
##  4 Whiteville     Columbus   Coastal Commercial        Urban (>70% Develop…
##  5 Wilmington     New Hanov… Coastal Residential       Urban (>70% Develop…
##  6 None - Rural … Robeson    Coastal Farms, Woods, Pa… Rural (<30% Develop…
##  7 None - Rural … Richmond   Piedmo… Residential       Mixed (30% To 70% D…
##  8 Raleigh        Wake       Piedmo… Commercial        Urban (>70% Develop…
##  9 Whiteville     Columbus   Coastal Residential       Rural (<30% Develop…
## 10 New Bern       Craven     Coastal Residential       Urban (>70% Develop…
## # … with 7,457 more rows
```

---

## `select` variables with certain characteristics

```r
ncbikecrash %>%
* select(starts_with("bike_"))
```

```
## # A tibble: 7,467 x 9
##    bike_age bike_age_group bike_alcohol bike_alcohol_dr… bike_direction
##    <chr>    <chr>          <chr>        <chr>            <chr>         
##  1 52       50-59          No           <NA>             With Traffic  
##  2 66       60-69          No           <NA>             With Traffic  
##  3 33       30-39          No           <NA>             With Traffic  
##  4 52       50-59          Yes          <NA>             <NA>          
##  5 22       20-24          No           <NA>             Facing Traffic
##  6 15       11-15          No           <NA>             With Traffic  
##  7 41       40-49          No           <NA>             Facing Traffic
##  8 14       11-15          No           <NA>             <NA>          
##  9 16       16-19          No           <NA>             Facing Traffic
## 10 54       50-59          No           <NA>             With Traffic  
## # … with 7,457 more rows, and 4 more variables: bike_injury <chr>,
## #   bike_position <chr>, bike_race <chr>, bike_sex <chr>
```

---

## `select` variables with certain characteristics

```r
ncbikecrash %>%
* select(ends_with("age"))
```

```
## # A tibble: 7,467 x 2
##    bike_age driver_age
##    <chr>    <chr>     
##  1 52       34        
##  2 66       <NA>      
##  3 33       37        
##  4 52       55        
##  5 22       25        
##  6 15       17        
##  7 41       <NA>      
##  8 14       50        
##  9 16       32        
## 10 54       69        
## # … with 7,457 more rows
```

---

## Select helpers

- `starts_with()`: Starts with a prefix
- `ends_with()`: Ends with a suffix
- `contains()`: Contains a literal string
- `num_range()`: Matches a numerical range like x01, x02, x03
- `one_of()`: Matches variable names in a character vector
- `everything()`: Matches all variables
- `last_col()`: Select last variable, possibly with an offset
- `matches()`: Matches a regular expression (a sequence of symbols/characters expressing a string/pattern to be searched for within text)

---

## `arrange` in ascending / descending order

```r
ncbikecrash %>%
  select(ends_with("age")) %>%
* arrange(bike_age)
```

```
## # A tibble: 7,467 x 2
##    bike_age driver_age
##    <chr>    <chr>     
##  1 0        47        
##  2 1        70+       
##  3 1        61        
##  4 10       30        
##  5 10       19        
##  6 10       22        
##  7 10       18        
##  8 10       27        
##  9 10       53        
## 10 10       <NA>      
## # … with 7,457 more rows
```
]
.pull-right[

```r
ncbikecrash %>%
  select(ends_with("age")) %>%
* arrange(desc(bike_age))
```

```
## # A tibble: 7,467 x 2
##    bike_age driver_age
##    <chr>    <chr>     
##  1 9        23        
##  2 9        35        
##  3 9        70+       
##  4 9        41        
##  5 9        53        
##  6 9        18        
##  7 9        45        
##  8 9        19        
##  9 9        70+       
## 10 9        59        
## # … with 7,457 more rows
```
]

---

## `slice` for certain row numbers

First five

```r
ncbikecrash %>%
* slice(1:5)
```

```
## # A tibble: 5 x 53
##   object_id city  county region development locality on_road rural_urban
##       <dbl> <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>      
## 1      1686 None… Wayne  Coast… Farms, Woo… Rural (… SR 1915 Rural      
## 2      1674 Hend… Vance  Piedm… Residential Mixed (… NICHOL… Urban      
## 3      1673 None… Linco… Piedm… Farms, Woo… Rural (… US 321  Rural      
## 4      1687 Whit… Colum… Coast… Commercial  Urban (… W BURK… Urban      
## 5      1653 Wilm… New H… Coast… Residential Urban (… RACINE… Urban      
## # … with 45 more variables: speed_limit <chr>, traffic_control <chr>,
## #   weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## #   bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## #   bike_injury <chr>, bike_position <chr>, bike_race <chr>,
## #   bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,
## #   driver_alcohol <chr>, driver_alcohol_drugs <chr>,
## #   driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,
## #   driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,
## #   crash_date <chr>, crash_day <chr>, crash_group <chr>,
## #   crash_hour <dbl>, crash_location <chr>, crash_month <chr>,
## #   crash_severity <chr>, crash_time <time>, crash_type <chr>,
## #   crash_year <dbl>, ambulance_req <chr>, hit_run <chr>,
## #   light_condition <chr>, road_character <chr>, road_class <chr>,
## #   road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## #   road_feature <chr>, road_surface <chr>, num_lanes <chr>,
## #   geo_point <chr>
```

---

## `slice` for certain row numbers

Last five

```r
last_row <- nrow(ncbikecrash)
ncbikecrash %>%
* slice((last_row - 4):last_row)
```

```
## # A tibble: 5 x 53
##   object_id city  county region development locality on_road rural_urban
##       <dbl> <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>      
## 1      6989 High… Guilf… Piedm… Residential Urban (… <NA>    Urban      
## 2      6991 Wilm… New H… Coast… Residential Urban (… <NA>    Urban      
## 3      6995 Kins… Lenoir Coast… Commercial  Urban (… <NA>    Urban      
## 4      6998 Faye… Cumbe… Coast… Residential Urban (… <NA>    Urban      
## 5      7000 None… Onslow Coast… Farms, Woo… Rural (… <NA>    Rural      
## # … with 45 more variables: speed_limit <chr>, traffic_control <chr>,
## #   weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## #   bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## #   bike_injury <chr>, bike_position <chr>, bike_race <chr>,
## #   bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,
## #   driver_alcohol <chr>, driver_alcohol_drugs <chr>,
## #   driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,
## #   driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,
## #   crash_date <chr>, crash_day <chr>, crash_group <chr>,
## #   crash_hour <dbl>, crash_location <chr>, crash_month <chr>,
## #   crash_severity <chr>, crash_time <time>, crash_type <chr>,
## #   crash_year <dbl>, ambulance_req <chr>, hit_run <chr>,
## #   light_condition <chr>, road_character <chr>, road_class <chr>,
## #   road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## #   road_feature <chr>, road_surface <chr>, num_lanes <chr>,
## #   geo_point <chr>
```

---

## `sample_n` / `sample_frac` for a random sample

- `sample_n`: randomly sample 5 observations

```r
ncbikecrash_n5 <- ncbikecrash %>%
* sample_n(5, replace = FALSE)
dim(ncbikecrash_n5)
```

```
## [1]  5 53
```

- `sample_frac`: randomly sample 20% of observations

```r
ncbikecrash_perc20 <-ncbikecrash %>%
* sample_frac(0.2, replace = FALSE)
dim(ncbikecrash_perc20)
```

```
## [1] 1493   53
```

---

## `filter` to select a subset of rows

Crashes in Durham County

```r
ncbikecrash %>%
* filter(county == "Durham")
```

```
## # A tibble: 340 x 53
##    object_id city  county region development locality on_road rural_urban
##        <dbl> <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>      
##  1      2452 Durh… Durham Piedm… Residential Urban (… <NA>    Urban      
##  2      2441 Durh… Durham Piedm… Commercial  Urban (… <NA>    Urban      
##  3      2466 Durh… Durham Piedm… Commercial  Urban (… <NA>    Urban      
##  4       549 Durh… Durham Piedm… Residential Urban (… PARK A… Urban      
##  5       598 Durh… Durham Piedm… Residential Urban (… BELT S… Urban      
##  6       603 Durh… Durham Piedm… Residential Urban (… HINSON… Urban      
##  7      3974 Durh… Durham Piedm… Commercial  Urban (… <NA>    Urban      
##  8      7134 Durh… Durham Piedm… Commercial  Urban (… <NA>    Urban      
##  9      1670 Durh… Durham Piedm… Commercial  Urban (… INFINI… Urban      
## 10      1773 Durh… Durham Piedm… Residential Urban (… <NA>    Urban      
## # … with 330 more rows, and 45 more variables: speed_limit <chr>,
## #   traffic_control <chr>, weather <chr>, workzone <chr>, bike_age <chr>,
## #   bike_age_group <chr>, bike_alcohol <chr>, bike_alcohol_drugs <chr>,
## #   bike_direction <chr>, bike_injury <chr>, bike_position <chr>,
## #   bike_race <chr>, bike_sex <chr>, driver_age <chr>,
## #   driver_age_group <chr>, driver_alcohol <chr>,
## #   driver_alcohol_drugs <chr>, driver_est_speed <chr>,
## #   driver_injury <chr>, driver_race <chr>, driver_sex <chr>,
## #   driver_vehicle_type <chr>, crash_alcohol <chr>, crash_date <chr>,
## #   crash_day <chr>, crash_group <chr>, crash_hour <dbl>,
## #   crash_location <chr>, crash_month <chr>, crash_severity <chr>,
## #   crash_time <time>, crash_type <chr>, crash_year <dbl>,
## #   ambulance_req <chr>, hit_run <chr>, light_condition <chr>,
## #   road_character <chr>, road_class <chr>, road_condition <chr>,
## #   road_configuration <chr>, road_defects <chr>, road_feature <chr>,
## #   road_surface <chr>, num_lanes <chr>, geo_point <chr>
```

---

## `filter` for many conditions at once

Crashes in Durham County where biker is 0-5 years old

```r
ncbikecrash %>%
* filter(
*   county == "Durham",
*   bike_age_group == "0-5"
*   )
```

```
## # A tibble: 4 x 53
##   object_id city  county region development locality on_road rural_urban
##       <dbl> <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>      
## 1      4062 Durh… Durham Piedm… Residential Urban (… <NA>    Urban      
## 2       414 Durh… Durham Piedm… Residential Urban (… PVA 90… Urban      
## 3      3016 Durh… Durham Piedm… Residential Urban (… <NA>    Urban      
## 4      1383 Durh… Durham Piedm… Residential Urban (… PVA 62… Urban      
## # … with 45 more variables: speed_limit <chr>, traffic_control <chr>,
## #   weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## #   bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## #   bike_injury <chr>, bike_position <chr>, bike_race <chr>,
## #   bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,
## #   driver_alcohol <chr>, driver_alcohol_drugs <chr>,
## #   driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,
## #   driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,
## #   crash_date <chr>, crash_day <chr>, crash_group <chr>,
## #   crash_hour <dbl>, crash_location <chr>, crash_month <chr>,
## #   crash_severity <chr>, crash_time <time>, crash_type <chr>,
## #   crash_year <dbl>, ambulance_req <chr>, hit_run <chr>,
## #   light_condition <chr>, road_character <chr>, road_class <chr>,
## #   road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## #   road_feature <chr>, road_surface <chr>, num_lanes <chr>,
## #   geo_point <chr>
```

---

## Logical operators in R

operator    | definition                   || operator     | definition
------------|------------------------------||--------------|----------------
`<`         | less than                    ||`x`&nbsp;&#124;&nbsp;`y`     | `x` OR `y` 
`<=`        |	less than or equal to        ||`is.na(x)`    | test if `x` is `NA`
`>`         | greater than                 ||`!is.na(x)`   | test if `x` is not `NA`
`>=`        |	greater than or equal to     ||`x %in% y`    | test if `x` is in `y`
`==`        |	exactly equal to             ||`!(x %in% y)` | test if `x` is not in `y`
`!=`        |	not equal to                 ||`!x`          | not `x`
`x & y`     | `x` AND `y`                  ||              |

---

.question[
Fill in the blanks for filtering for crashes **not** in Durham County where crash year is after 2014 and `bike_position` is not `NA`.
]

```r
ncbikecrash %>%
  filter(
    county ____ "Durham", 
    crash_year ____ 2014,
    ____
    )
```

---

.question[
Fill in the blanks for filtering for crashes **not** in Durham County where crash year is after 2014 and `bike_position` is not `NA`.
]

```r
ncbikecrash %>%
  filter(
*   county != "Durham",
*   crash_year > 2014,
*   !is.na(bike_position)
    )
```

```
## # A tibble: 0 x 53
## # … with 53 variables: object_id <dbl>, city <chr>, county <chr>,
## #   region <chr>, development <chr>, locality <chr>, on_road <chr>,
## #   rural_urban <chr>, speed_limit <chr>, traffic_control <chr>,
## #   weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## #   bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## #   bike_injury <chr>, bike_position <chr>, bike_race <chr>,
## #   bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,
## #   driver_alcohol <chr>, driver_alcohol_drugs <chr>,
## #   driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,
## #   driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,
## #   crash_date <chr>, crash_day <chr>, crash_group <chr>,
## #   crash_hour <dbl>, crash_location <chr>, crash_month <chr>,
## #   crash_severity <chr>, crash_time <time>, crash_type <chr>,
## #   crash_year <dbl>, ambulance_req <chr>, hit_run <chr>,
## #   light_condition <chr>, road_character <chr>, road_class <chr>,
## #   road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## #   road_feature <chr>, road_surface <chr>, num_lanes <chr>,
## #   geo_point <chr>
```

---

## `distinct` to filter for unique rows

... and `arrange` to order alphabetically

```r
ncbikecrash %>% 
* distinct(county) %>%
  arrange(county)
```

```
## # A tibble: 101 x 1
##    county   
##    <chr>    
##  1 Alamance 
##  2 Alexander
##  3 Alleghany
##  4 Anson    
##  5 Ashe     
##  6 Avery    
##  7 Beaufort 
##  8 Bertie   
##  9 Bladen   
## 10 Brunswick
## # … with 91 more rows
```
]
.pull-right[

```r
ncbikecrash %>% 
  select(county, city) %>% 
* distinct() %>%
  arrange(county, city)
```

```
## # A tibble: 391 x 2
##    county    city              
##    <chr>     <chr>             
##  1 Alamance  Alamance          
##  2 Alamance  Burlington        
##  3 Alamance  Elon              
##  4 Alamance  Elon College      
##  5 Alamance  Gibsonville       
##  6 Alamance  Graham            
##  7 Alamance  Green Level       
##  8 Alamance  Mebane            
##  9 Alamance  None - Rural Crash
## 10 Alexander None - Rural Crash
## # … with 381 more rows
```
]