Data types and recoding 💽

# Data types and recoding <br> 💽
### Dr. Çetinkaya-Rundel

---

layout: true
  
<div class="my-footer">
<span>
Dr. Mine Çetinkaya-Rundel -
<a href="https://introds.org" target="_blank">introds.org
</a>
</span>
</div>

---

## So far this week...

- Data wrangling, transforming, manipulating
- Teams: Please sit with your teams in lectures and in workshops.
- Workshops: You need to be in class during the workshop to get credit for the labs. If you haven't met team members yesterday, let's help you find them today.
- OQ 2: Posted. Due date pushed back one day to Saturday 17:00 if you need extra time.
- HW 2: Due today at 17:00

---

# Data wrangling with dplyr (continued)

---

## `mutate` to create new variables

```r
ncbikecrash %>%
* mutate(
*   crash_week = if_else(condition = crash_day %in% c("Saturday", "Sunday"),
*                        true = "Weekend",
*                        false = "Weekday")
*   ) %>%
  select(crash_day, crash_week)
```

```
## # A tibble: 7,467 x 2
##    crash_day crash_week
##    <chr>     <chr>     
##  1 Wednesday Weekday   
##  2 Wednesday Weekday   
##  3 Sunday    Weekend   
##  4 Saturday  Weekend   
##  5 Thursday  Weekday   
##  6 Tuesday   Weekday   
##  7 Wednesday Weekday   
##  8 Tuesday   Weekday   
##  9 Monday    Weekday   
## 10 Friday    Weekday   
## # … with 7,457 more rows
```
]

---

## "Save" when you `mutate`

Most often when you define a new variable with `mutate` you'll also want to save the resulting data frame, often by writing over the original data frame

```r
*ncbikecrash <- ncbikecrash %>%
  mutate(
    crash_week = if_else(condition = crash_day %in% c("Saturday", "Sunday"), 
                         true = "Weekend", 
                         false = "Weekday")
    )
```
]

---

## Check before you move on

```r
ncbikecrash %>% 
  distinct(crash_week, crash_day) %>% 
  arrange(crash_week)
```

```
## # A tibble: 7 x 2
##   crash_week crash_day
##   <chr>      <chr>    
## 1 Weekday    Wednesday
## 2 Weekday    Thursday 
## 3 Weekday    Tuesday  
## 4 Weekday    Monday   
## 5 Weekday    Friday   
## 6 Weekend    Sunday   
## 7 Weekend    Saturday
```

---

```r
ncbikecrash %>%
  mutate(crash_time_of_day = case_when(
    crash_hour > 5    & crash_hour <= 11 ~ "morning",
    crash_hour > 11   & crash_hour <= 17 ~ "afternoon",
    crash_hour > 17   & crash_hour <= 23 ~ "evening",
    crash_hour == 24  | crash_hour <= 5  ~ "night",
  )) %>%
  select(crash_hour, crash_time_of_day)
```

```
## # A tibble: 7,467 x 2
##    crash_hour crash_time_of_day
##         <dbl> <chr>            
##  1          6 morning          
##  2         20 evening          
##  3         18 evening          
##  4         18 evening          
##  5         13 afternoon        
##  6         17 afternoon        
##  7         17 afternoon        
##  8          7 morning          
##  9         15 afternoon        
## 10          2 night            
## # … with 7,457 more rows
```

---

## `summarise` to reduce variables to values

```r
ncbikecrash %>%
* summarise(avg_hr = mean(crash_hour))
```

```
## # A tibble: 1 x 1
##   avg_hr
##    <dbl>
## 1   14.7
```

---

## `group_by` to do calculations on groups

```r
ncbikecrash %>%
* group_by(hit_run) %>%
  summarise(avg_hr = mean(crash_hour))
```

```
## # A tibble: 2 x 2
##   hit_run avg_hr
##   <chr>    <dbl>
## 1 No        14.6
## 2 Yes       15.0
```

---

## `count` observations in groups

```r
ncbikecrash %>%
* count(driver_alcohol_drugs)
```

```
## # A tibble: 6 x 2
##   driver_alcohol_drugs                    n
##   <chr>                               <int>
## 1 Missing                                99
## 2 No                                    695
## 3 Yes-Alcohol,  impairment suspected     12
## 4 Yes-Alcohol, no impairment detected     3
## 5 Yes-Drugs, impairment suspected         4
## 6 <NA>                                 6654
```

---

## `count` and then arrange

```r
ncbikecrash %>%
  count(driver_alcohol_drugs) %>%
* arrange(desc(n))
```

```
## # A tibble: 6 x 2
##   driver_alcohol_drugs                    n
##   <chr>                               <int>
## 1 <NA>                                 6654
## 2 No                                    695
## 3 Missing                                99
## 4 Yes-Alcohol,  impairment suspected     12
## 5 Yes-Drugs, impairment suspected         4
## 6 Yes-Alcohol, no impairment detected     3
```

---

## `count` and then arrange, take two

```r
ncbikecrash %>%
* count(driver_alcohol_drugs, sort = TRUE)
```

---

# Data classes and types

---

## Data types in R

* **logical**
* **double**
* **integer**
* **character**
* **lists**
* and some more, but we won't be focusing on those

---

## Logical & character

**logical** - boolean values `TRUE` and `FALSE`

```r
typeof(TRUE)
```

```
## [1] "logical"
```

**character** - character strings

```r
typeof("hello")
```

```
## [1] "character"
```

```r
typeof('world') # but remember, we use double quotations!
```

```
## [1] "character"
```

---

## Double & integer

**double** - floating point numerical values (default numerical type)

```r
typeof(1.335)
```

```
## [1] "double"
```

```r
typeof(7)
```

```
## [1] "double"
```

**integer** - integer numerical values (indicated with an `L`)

```r
typeof(7L)
```

```
## [1] "integer"
```

```r
typeof(1:3)
```

```
## [1] "integer"
```

---

## Lists

**Lists** are 1d objects that can contain any combination of R objects

```r
mylist <- list("A", 1:4, c(TRUE, FALSE), (1:4)/2)
mylist
```

```
## [[1]]
## [1] "A"
## 
## [[2]]
## [1] 1 2 3 4
## 
## [[3]]
## [1]  TRUE FALSE
## 
## [[4]]
## [1] 0.5 1.0 1.5 2.0
```

```r
str(mylist)
```

```
## List of 4
##  $ : chr "A"
##  $ : int [1:4] 1 2 3 4
##  $ : logi [1:2] TRUE FALSE
##  $ : num [1:4] 0.5 1 1.5 2
```
]

---

## Named lists

Because of their more complex structure we often want to name the elements of a list (we 
can also do this with vectors). This can make reading and accessing the list more 
straight forward.

```r
myotherlist <- list(A = "hello", B = 1:4, "knock knock" = "who's there?")
str(myotherlist)
```

```
## List of 3
##  $ A          : chr "hello"
##  $ B          : int [1:4] 1 2 3 4
##  $ knock knock: chr "who's there?"
```

```r
names(myotherlist)
```

```
## [1] "A"           "B"           "knock knock"
```

```r
myotherlist$B
```

```
## [1] 1 2 3 4
```
]

---

## Concatenation

Vectors can be constructed using the `c()` function.

```r
c(1, 2, 3)
```

```
## [1] 1 2 3
```

```r
c("Hello", "World!")
```

```
## [1] "Hello"  "World!"
```

```r
c(1, c(2, c(3)))
```

```
## [1] 1 2 3
```

---

## Coercion

R is a dynamically typed language -- it will happily convert between the various types 
without complaint.

```r
c(1, "Hello")
```

```
## [1] "1"     "Hello"
```

```r
c(FALSE, 3L)
```

```
## [1] 0 3
```

```r
c(1.2, 3L)
```

```
## [1] 1.2 3.0
```

---

## Missing Values

R uses `NA` to represent missing values in its data structures.

```r
typeof(NA)
```

```
## [1] "logical"
```

---

## `NA`s are special ❄️s

```r
x <- c(1, 2, 3, 4, NA)
```

```r
mean(x)
```

```
## [1] NA
```

```r
mean(x, na.rm = TRUE)
```

```
## [1] 2.5
```

```r
summary(x)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    1.75    2.50    2.50    3.25    4.00       1
```

---

## Other Special Values

`NaN` - Not a number

`Inf` - Positive infinity

`-Inf` - Negative infinity

```r
pi / 0
```

```
## [1] Inf
```

```r
0 / 0
```

```
## [1] NaN
```

```r
1/0 + 1/0
```

```
## [1] Inf
```
]
.pull-right[

```r
1/0 - 1/0
```

```
## [1] NaN
```

```r
NaN / NA
```

```
## [1] NaN
```

```r
NaN * NA
```

```
## [1] NaN
```
]

---

.question[
What is the type of the following vectors? Explain why they have that type. You're welcomed to try them out as well.
]

* `c(1, NA+1L, "C")`
* `c(1L / 0, NA)`
* `c(1:3, 5)`
* `c(3L, NaN+1L)`
* `c(NA, TRUE)`

---

## Example: Cat lovers

A survey asked respondents their name and number of cats. The instructions said to enter the number of cats as a numerical value.

```r
cat_lovers <- read_csv("data/cat-lovers.csv")
```

```
## # A tibble: 60 x 3
##    name           number_of_cats handedness
##    <chr>          <chr>          <chr>     
##  1 Bernice Warren 0              left      
##  2 Woodrow Stone  0              left      
##  3 Willie Bass    1              left      
##  4 Tyrone Estrada 3              left      
##  5 Alex Daniels   3              left      
##  6 Jane Bates     2              left      
##  7 Latoya Simpson 1              left      
##  8 Darin Woods    1              left      
##  9 Agnes Cobb     0              left      
## 10 Tabitha Grant  0              left      
## # … with 50 more rows
```

---

## Oh why won't you work?!

```r
cat_lovers %>%
  summarise(mean = mean(number_of_cats))
```

```
## Warning in mean.default(number_of_cats): argument is not numeric or
## logical: returning NA
```

```
## # A tibble: 1 x 1
##    mean
##   <dbl>
## 1    NA
```

---

## Oh why won't you still work??!!

```r
cat_lovers %>%
  summarise(mean_cats = mean(number_of_cats, na.rm = TRUE))
```

```
## Warning in mean.default(number_of_cats, na.rm = TRUE): argument is not
## numeric or logical: returning NA
```

```
## # A tibble: 1 x 1
##   mean_cats
##       <dbl>
## 1        NA
```

---

## Take a breath and look at your data

```r
glimpse(cat_lovers)
```

```
## Observations: 60
## Variables: 3
## $ name           <chr> "Bernice Warren", "Woodrow Stone", "Willie Bass",…
## $ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", "1", "0", "0",…
## $ handedness     <chr> "left", "left", "left", "left", "left", "left", "…
```

---

## Let's take another look

.small[
<div id="htmlwidget-ce49beb9f5da425a465e" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-ce49beb9f5da425a465e">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51","52","53","54","55","56","57","58","59","60"],["Bernice Warren","Woodrow Stone","Willie Bass","Tyrone Estrada","Alex Daniels","Jane Bates","Latoya Simpson","Darin Woods","Agnes Cobb","Tabitha Grant","Perry Cross","Wanda Silva","Alicia Sims","Emily Logan","Woodrow Elliott","Brent Copeland","Pedro Carlson","Patsy Luna","Brett Robbins","Oliver George","Calvin Perry","Lora Gutierrez","Charlotte Sparks","Earl Mack","Leslie Wade","Santiago Barker","Jose Bell","Lynda Smith","Bradford Marshall","Irving Miller","Caroline Simpson","Frances Welch","Melba Jenkins","Veronica Morales","Juanita Cunningham","Maurice Howard","Teri Pierce","Phil Franklin","Jan Zimmerman","Leslie Price","Bessie Patterson","Ethel Wolfe","Naomi Wright","Sadie Frank","Lonnie Cannon","Tony Garcia","Darla Newton","Ginger Clark","Lionel Campbell","Florence Klein","Harriet Leonard","Terrence Harrington","Travis Garner","Doug Bass","Pat Norris","Dawn Young","Shari Alvarez","Tamara Robinson","Megan Morgan","Kara Obrien"],["0","0","1","3","3","2","1","1","0","0","0","0","1","3","3","2","1","1","0","0","1","1","0","0","4","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","1","3","3","2","1","1.5 - honestly I think one of my cats is half human","0","0","0","0","1","three","1","1","1","0","0","2"],["left","left","left","left","left","left","left","left","left","left","left","left","left","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","ambidextrous","ambidextrous","ambidextrous","ambidextrous","ambidextrous"]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>name<\/th>\n      <th>number_of_cats<\/th>\n      <th>handedness<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"order":[],"autoWidth":false,"orderClasses":false,"columnDefs":[{"orderable":false,"targets":0}]}},"evals":[],"jsHooks":[]}</script>
]

---

## Sometimes you need to babysit your respondents

```r
cat_lovers %>%
  mutate(number_of_cats = case_when(
    name == "Ginger Clark" ~ 2,
    name == "Doug Bass"    ~ 3,
    TRUE                   ~ as.numeric(number_of_cats)
    )) %>%
  summarise(mean_cats = mean(number_of_cats))
```

```
## Warning in eval_tidy(pair$rhs, env = default_env): NAs introduced by
## coercion
```

```
## # A tibble: 1 x 1
##   mean_cats
##       <dbl>
## 1     0.817
```

---

## Always you need to respect data types

```r
cat_lovers %>%
  mutate(
    number_of_cats = case_when(
      name == "Ginger Clark" ~ "2",
      name == "Doug Bass"    ~ "3",
      TRUE                   ~ number_of_cats
      ),
    number_of_cats = as.numeric(number_of_cats)
    ) %>%
  summarise(mean_cats = mean(number_of_cats))
```

```
## # A tibble: 1 x 1
##   mean_cats
##       <dbl>
## 1     0.817
```

---

## Now that we know what we're doing...

```r
*cat_lovers <- cat_lovers %>%
  mutate(
    number_of_cats = case_when(
      name == "Ginger Clark" ~ "2",
      name == "Doug Bass"    ~ "3",
      TRUE                   ~ number_of_cats
      ),
    number_of_cats = as.numeric(number_of_cats)
    )
```

---

## Moral of the story

- If your data does not behave how you expect it to, type coercion upon reading in the data might be the reason.
- Go in and investigate your data, apply the fix, *save your data*, live happily ever after.

---

## Vectors vs. lists

```r
x <- c(8,4,7)
```
]
.small[

```r
x[1]
```

```
## [1] 8
```
]
.small[

```r
x[[1]]
```

```
## [1] 8
```
]
]
--
.pull-right[
.small[

```r
y <- list(8,4,7)
```
]
.small[

```r
y[2]
```

```
## [[1]]
## [1] 4
```
]
.small[

```r
y[[2]]
```

```
## [1] 4
```
]
]

<br>

**Note:** When using tidyverse code you'll rarely need to refer to elements using square brackets, but it's good to be aware of this syntax, especially since you might encounter it when searching for help online.

---

---

# Data "set"

---

## Data "sets" in R

- "set" is in quotation marks because it is not a formal data class
- A tidy data "set" can be one of the following types:
    + `tibble`
    + `data.frame`
- We'll often work with `tibble`s:
    + `readr` package (e.g. `read_csv` function) loads data as a `tibble` by default
    + `tibble`s are part of the tidyverse, so they work well with other packages we are using
    + they make minimal assumptions about your data, so are less likely to cause hard to track bugs in your code

---

## Data frames

- A data frame is the most commonly used data structure in R, they are just a list of equal length vectors (usually atomic, but you can use generic as well). Each vector is treated as a column and elements of the vectors as rows.
- A tibble is a type of data frame that ... makes your life (i.e. data analysis) easier.
- Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch.

```r
df <- tibble(x = 1:3, y = c("a", "b", "c"))
class(df)
```

```
## [1] "tbl_df"     "tbl"        "data.frame"
```

```r
glimpse(df)
```

```
## Observations: 3
## Variables: 2
## $ x <int> 1, 2, 3
## $ y <chr> "a", "b", "c"
```

---

## Data frames (cont.)

```r
attributes(df)
```

```
## $names
## [1] "x" "y"
## 
## $row.names
## [1] 1 2 3
## 
## $class
## [1] "tbl_df"     "tbl"        "data.frame"
```

```r
class(df$x)
```

```
## [1] "integer"
```

```r
class(df$y)
```

```
## [1] "character"
```

---

## Working with tibbles in pipelines

```r
mean_cats <- cat_lovers %>%
  summarise(mean_cats = mean(number_of_cats))

cat_lovers %>%
  filter(number_of_cats < mean_cats) %>%
  nrow()
```

```
## [1] 60
```

---

## A result of a pipeline is always a tibble

```r
mean_cats
```

```
## # A tibble: 1 x 1
##   mean_cats
##       <dbl>
## 1     0.817
```

```r
class(mean_cats)
```

```
## [1] "tbl_df"     "tbl"        "data.frame"
```

---

## `pull()` can be your new best friend

But use it sparingly!

```r
mean_cats <- cat_lovers %>%
  summarise(mean_cats = mean(number_of_cats)) %>%
  pull()

cat_lovers %>%
  filter(number_of_cats < mean_cats) %>%
  nrow()
```

```
## [1] 33
```

```r
mean_cats
```

```
## [1] 0.8166667
```
]
.pull-right[

```r
class(mean_cats)
```

```
## [1] "numeric"
```
]

---

# Factors

---

## Factors

Factor objects are how R stores data for categorical variables (fixed numbers of discrete values).

```r
(x = factor(c("BS", "MS", "PhD", "MS")))
```

```
## [1] BS  MS  PhD MS 
## Levels: BS MS PhD
```

```r
glimpse(x)
```

```
##  Factor w/ 3 levels "BS","MS","PhD": 1 2 3 2
```

```r
typeof(x)
```

```
## [1] "integer"
```

---

## Read data in as character strings

```r
glimpse(cat_lovers)
```

```
## Observations: 60
## Variables: 3
## $ name           <chr> "Bernice Warren", "Woodrow Stone", "Willie Bass",…
## $ number_of_cats <dbl> 0, 0, 1, 3, 3, 2, 1, 1, 0, 0, 0, 0, 1, 3, 3, 2, 1…
## $ handedness     <chr> "left", "left", "left", "left", "left", "left", "…
```

---

## But coerce when plotting

```r
ggplot(cat_lovers, mapping = aes(x = handedness)) +
  geom_bar()
```

---

## Use forcats to manipulate factors

```r
cat_lovers <- cat_lovers %>%
  mutate(handedness = fct_relevel(handedness, 
                                  "right", "left", "ambidextrous"))
```

---

## Come for the functionality

.pull-left[
... stay for the logo
]
.pull-right[
<img src="img/forcats-part-of-tidyverse.png" width="60%" />
]

- R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. Historically, factors were much easier to work with than character vectors, so many base R functions automatically convert character vectors to factors.
- Factors are useful when you have true categorical data, and when you want to override the ordering of character vectors to improve display. The **forcats** package provides a suite of useful tools that solve common problems with factors.

---

## Recap

- Always best to think of data as part of a tibble
    + This plays nicely with the `tidyverse` as well
    + Rows are observations, columns are variables
--
- Be careful about data types / classes
    + Sometimes `R` makes silly assumptions about your data class 
        + Using `tibble`s help, but it might not solve all issues
        + Think about your data in context, e.g. 0/1 variable is most likely a `factor`
    + If a plot/output is not behaving the way you expect, first
    investigate the data class
    + If you are absolutely sure of a data class, overwrite it in your
    tibble so that you don't need to keep having to keep track of it
        + `mutate` the variable with the correct class

```r
knitr::knit_exit()
```