class: center, middle, inverse, title-slide # Functions and iteration
🤖 ### Dr. Çetinkaya-Rundel --- layout: true <div class="my-footer"> <span> Dr. Mine Çetinkaya-Rundel - <a href="https://introds.org" target="_blank">introds.org </a> </span> </div> --- ## Midterm feedback - What is your opinion of this class so far? - Positive overall, challenging, workload (particularly long readings) - What do you like best about this class so far? - Data examples, teamwork/workshops/labs, data visualization - Of the topics we have covered so far, which (if any) are you confused about? - Data joins, pivoting, different environments for console and R Markdown causing errors like variable not found - What are the most useful aspects of your workshops for this course? - Group work, datasets --- ## Midterm feedback (cont.) - Are there any aspects of your workshops for this course that could be improved? - Physical layout/group seating, technical issues, tutor help - What are the most useful aspects of your workshops for this course? - Actual code and examples, talking through code with classmates - Are there any aspects of your workshops for this course that could be improved? - More activities, more statistics --- ## Midterm feedback (cont.) - The pace of the class is: - About right - 37 - Too fast - 16 - Too slow - 1 - No opinion - 1 - <Unanswered> - 2 - "Data science is worth learning." - TRUE - 53 - FALSE - 2 - <Unanswered> - 2 --- ## Midterm feedback (cont.) - Do you have any other comments or suggestions about specific components of this course? - Expected more stat heavy -- coming! - Prepare for workshops on own first -- we can do that, but all team members need to commit - Shorter readings preferred - Piazza is helpful - Don't see the point in online quizzes since you get to try as many times as you like - Would like a second lab .question[ .large[ Any questions? ] ] --- class: center, middle # Edinburgh College of Art Collection ---
--- <img src="img/untitled-1963.png" width="547" /> --- <img src="img/untitled-1963-headers-values.png" width="547" /> --- <img src="img/untitled-1963-to-df.png" width="892" /> --- ## <i class="fas fa-laptop"></i> Hands on - UoE Art Collection - Go to [rstudio.cloud](https://rstudio.cloud/spaces/34062/projects) - Start the assignment called *Hands on - Functions and iteration* - Open `01-individual-pieces.R` - Follow along, and fill in the blanks if you like --- .small[ ```r # load packages ---------------------------------------------------------------- library(tidyverse) library(rvest) # first url -------------------------------------------------------------------- ## set url ---- first_info_url <- "https://collections.ed.ac.uk/art/record/22024?highlight=*:*" ## read page at url ---- page <- read_html(first_info_url) ## scrape headers ---- headers <- page %>% html_nodes("th") %>% html_text() ## scrape values ---- values <- page %>% html_nodes("td") %>% html_text() %>% str_squish() ## put together in a tibble and add link to help keep track ---- tibble(headers, values) %>% pivot_wider(names_from = headers, values_from = values) %>% add_column(link = first_info_url) ``` ] --- class: center, middle # Functions --- ## When should you write a function? -- <img src="img/funct-all-things.png" width="70%" style="display: block; margin: auto;" /> --- ## When should you write a function? Whenever you’ve copied and pasted a block of code more than twice. <br><br> -- .question[ How many times will be need to copy and paste the code we developed to scrape additional data on each art piece in the Edinburgh College of Art Collection? ] --- ## Why functions? - Automate common tasks in a power powerful and general way than copy-and-pasting: - You can give a function an evocative name that makes your code easier to understand. - As requirements change, you only need to update code in one place, instead of many. - You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another). -- - Down the line: Improve your reach as a data scientist by writing functions (and packages!) that others use --- ## Inputs .question[ How many inputs does the following code have? ] .small[ ```r ## set url ---- first_info_url <- "https://collections.ed.ac.uk/art/record/22024?highlight=*:*" ## read page at url ---- page <- read_html(first_info_url) ## scrape headers ---- headers <- page %>% html_nodes("th") %>% html_text() ## scrape values ---- values <- page %>% html_nodes("td") %>% html_text() %>% str_squish() ## put together in a tibble and add link to help keep track ---- tibble(headers, values) %>% pivot_wider(names_from = headers, values_from = values) %>% add_column(link = first_info_url) ``` ] --- ## Inputs .question[ How many inputs does the following code have? ] .small[ ```r ## set url ---- *first_info_url <- "https://collections.ed.ac.uk/art/record/22024?highlight=*:*" ## read page at url ---- *page <- read_html(first_info_url) ## scrape headers ---- headers <- page %>% html_nodes("th") %>% html_text() ## scrape values ---- values <- page %>% html_nodes("td") %>% html_text() %>% str_squish() ## put together in a tibble and add link to help keep track ---- tibble(headers, values) %>% pivot_wider(names_from = headers, values_from = values) %>% * add_column(link = first_info_url) ``` ] --- ## Turn your code into a function - Pick a short but informative **name**, preferably a verb. <br> <br> <br> <br> ```r scrape_art_info <- ``` --- ## Turn your code into a function - Pick a short but evocative **name**, preferably a verb. - List inputs, or **arguments**, to the function inside `function`. If we had more the call would look like `function(x, y, z)`. <br> ```r scrape_art_info <- function(x){ } ``` --- ## Turn your code into a function - Pick a short but informative **name**, preferably a verb. - List inputs, or **arguments**, to the function inside `function`. If we had more the call would look like `function(x, y, z)`. - Place the **code** you have developed in body of the function, a `{` block that immediately follows `function(...)`. ```r scrape_art_info <- function(x){ # code we developed earlier to scrape info # on single art piece goes here } ``` --- ## <i class="fas fa-laptop"></i> Hands on - UoE Art Collection - Go to [rstudio.cloud](https://rstudio.cloud/spaces/34062/projects) - Continue the assignment called *Hands on - Functions and iteration* - Open `02-functionalize.R` - Follow along, and fill in the blanks if you like --- .midi[ ```r scrape_art_info <- function(x){ # read page at url ---- page <- read_html(x) # scrape headers ---- headers <- page %>% html_nodes("th") %>% html_text() # scrape values ---- values <- page %>% html_nodes("td") %>% html_text() %>% str_squish() # put together in a tibble and add link to help keep track ---- tibble(headers, values) %>% pivot_wider(names_from = headers, values_from = values) %>% add_column(link = x) } ``` ] --- ## Function in action .midi[ ```r scrape_art_info(uoe_art$link[1]) %>% glimpse() ``` ``` ## Observations: 1 ## Variables: 12 ## $ Artist <chr> "[Unknown] Black" ## $ Title <chr> "Untitled" ## $ Date <chr> "1963" ## $ Period <chr> "20th century; 1960s" ## $ Description <chr> "Portrait of a woman" ## $ Material <chr> "graphite (mineral)/inorganic material/materi… ## $ Dimensions <chr> "45.1cm H x 33cm W" ## $ Collection <chr> "Art Collection; Edinburgh College of Art" ## $ Classification <chr> "drawings (visual works); pencil drawings" ## $ Signature <chr> "Signature on drawing reads \"Black\", this i… ## $ `Accession Number` <chr> "EU2907" ## $ link <chr> "https://collections.ed.ac.uk/art/record/2202… ``` ] <img class="imageoverlay" src="img/untitled-1963.png" style="width:378px;height:300px;border:1"> --- ## Function in action ```r scrape_art_info(uoe_art$link[2]) %>% glimpse() ``` ``` ## Observations: 1 ## Variables: 7 ## $ Artist <chr> "A McIntosh" ## $ Title <chr> "Rooftops" ## $ Material <chr> "paint (coating)" ## $ Collection <chr> "Art Collection; Edinburgh College of Art" ## $ Classification <chr> "paintings (visual works)" ## $ `Accession Number` <chr> "EU2339" ## $ link <chr> "https://collections.ed.ac.uk/art/record/2146… ``` <img class="imageoverlay" src="img/rooftops.png" style="width:388px;height:250px;border:1"> --- ## Function in action .midi[ ```r scrape_art_info(uoe_art$link[3]) %>% glimpse() ``` ``` ## Observations: 1 ## Variables: 9 ## $ Artist <chr> "A Robson" ## $ Title <chr> "Seated Male Nude" ## $ Date <chr> "19 Nov 1958" ## $ Period <chr> "20th century; 1950s" ## $ Description <chr> "Gained his Diploma in 1959 from ECA." ## $ Material <chr> "graphite (mineral)/inorganic material/materi… ## $ Collection <chr> "Art Collection; Edinburgh College of Art" ## $ `Accession Number` <chr> "EU1810" ## $ link <chr> "https://collections.ed.ac.uk/art/record/2093… ``` ] <img class="imageoverlay" src="img/seated-male-nude.png" style="width:380px;height:250px;border:1"> --- ## What goes in / what comes out? - They take input(s) defined in the function definition .midi[ ```r function([inputs separated by commas]){ # what to do with those inputs } ``` ] - By default they return the last value computed in the function .midi[ ```r scrape_page <- function(x){ # do bunch of stuff with the input... # return a tibble tibble(...) } ``` ] - You can define more outputs to be returned in a list as well as nice print methods (but we won't go there for now...) --- .question[ What is going on here? ] ```r add_2 <- function(x){ x + 2 1000 } ``` ```r add_2(3) ``` ``` ## [1] 1000 ``` ```r add_2(10) ``` ``` ## [1] 1000 ``` --- ## Naming functions > "There are only two hard things in Computer Science: cache invalidation and naming things." - Phil Karlton -- - Names should be short but clearly evoke what the function does -- - Names should be verbs, not nouns -- - Multi-word names should be separated by underscores (`snake_case` as opposed to `camelCase`) -- - A family of functions should be named similarly (`scrape_page`, `scrape_art_info` OR `str_squish`, `str_trim`, `str_remove` etc.) -- - Avoid overwriting existing (especially widely used) functions .small[ ```r # JUST DON'T mean <- function(x){ x * 3 } ``` ] --- class: center, middle # Automation --- ## Define the task - Goal: Scrape info on all 2909 art pieces in the collection - So far: .small[ ```r scrape_art_info(uoe_art$link[1]) scrape_art_info(uoe_art$link[2]) scrape_art_info(uoe_art$link[3]) ``` ] - What else do we need to do? - Run the `scrape_art_info()` function to all 2909 links - Combine the resulting data frames from each run into one giant data frame with 2909 rows --- ## Inputs .question[ You now have a function that will scrape the relevant info on art pieces given the URL of its individual info page. Where can we get a list of URLs of each of the art pieces in the collection? ] -- From the data frame you constucted in lab yesterday: `uoe_art$link` --- ## Iteration .question[ How can we tell R to apply the `scrape_art_info()` function to each link in `uoe_art$link`? ] -- - Option 1: Write a **for loop**, i.e. explicitly tell R to visit a link, apply the function, store the result, then visit the next link, apply the function, append the result to the stored result from the previous link, and so on and so forth. -- - Option 2: **Map** the function to each element in the list of links, and let R take care of the storing and appending of results. -- - We'll go with Option 2! --- ## How does mapping work? Suppose we have exam 1 and exam 2 scores of 4 students stored in a list... ```r exam_scores <- list( exam1 <- c(80, 90, 70, 50), exam2 <- c(85, 83, 45, 60) ) ``` -- ...and we find the mean score in each exam ```r map(exam_scores, mean) ``` ``` ## [[1]] ## [1] 72.5 ## ## [[2]] ## [1] 68.25 ``` --- ...and suppose we want the results as a numeric (double) vector ```r map_dbl(exam_scores, mean) ``` ``` ## [1] 72.50 68.25 ``` ...or as a character string ```r map_chr(exam_scores, mean) ``` ``` ## [1] "72.500000" "68.250000" ``` --- ## `map_something` Functions for looping over an object and returning a value (of a specific type): * `map()` - returns a list * `map_lgl()` - returns a logical vector * `map_int()` - returns a integer vector * `map_dbl()` - returns a double vector * `map_chr()` - returns a character vector * `map_df()` / `map_dfr()` - returns a data frame by row binding * `map_dfc()` - returns a data frame by column binding * ... --- ## <i class="fas fa-laptop"></i> Hands on - UoE Art Collection - Go to [rstudio.cloud](https://rstudio.cloud/spaces/34062/projects) - Continue the assignment called *Hands on - Functions and iteration* - Open `03-iterate.R` - Follow along, and fill in the blanks if you like --- ## Go to each page, scrape art info - Map the `scrape_art_info()` function - to each element of `uoe_art$link` - and return a data frame by row binding ```r uoe_art_info <- map_df(uoe_art$link, scrape_art_info) ``` ---
--- ## What could go wrong? ```r uoe_art_info <- map_df(uoe_art$link, scrape_art_info) ``` - This will take a while to run - If you get `HTTP Error 429 (Too many requests)` you might want to slow down your hits by modifying your function to slow it down by adding a random wait (sleep) time between hitting each link ```r scrape_art_info <- function(x){ # Sleep for randomly generated number of seconds # Generated from a uniform distribution between 0 and 1 * Sys.sleep(runif(1)) # Rest of your function code goes here... } ```