Webscraping 🕸

# Webscraping <br> 🕸
### Dr. Çetinkaya-Rundel

---

layout: true
  
<div class="my-footer">
<span>
Dr. Mine Çetinkaya-Rundel -
<a href="https://introds.org" target="_blank">introds.org
</a>
</span>
</div>

---

## Week 6

- Web scraping, and automating it via the use of functions and iteration 
(which we'll revisit many times over the rest of the semester)
- Team peer evaluations - due Thursday 17:00 (look for an email from TEAMMATES)
- HW and OQ scores are in gradebook
  - For OQ: 1 means you did it, 0 means you didn't
  - For HW: See repo issues for actual feedback
- Workshop tomorrow - Take a seat where everyone can work together and have access to a computer. If you know you'll be a few minutes late, let your teammates know so they can save a spot for you.

---

# Scraping the web

---

## Scraping the web: what? why?

- Increasing amount of data is available on the web
--

- These data are provided in an unstructured format: you can always copy&paste, 
but it's time-consuming and prone to errors

--
- Web scraping is the process of extracting this information automatically and transform it into a structured dataset

--
- Two different scenarios:
    - Screen scraping: extract data from source code of website, with html 
    parser (easy) or regular expression matching (less easy).
    - Web APIs (application programming interface): website offers a set of 
    structured http requests that return JSON or XML files.

---

# Web Scraping with rvest

---

## Hypertext Markup Language

- Most of the data on the web is still largely available as HTML 
- It is structured (hierarchical / tree based), but it''s often not available in 
a form useful for analysis (flat / tidy).

```html
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p align="center">Hello world!</p>
  </body>
</html>
```

---

## rvest

.pull-left[
- The **rvest** package makes basic processing and manipulation of HTML data straight forward
- It's designed to work with pipelines built with `%>%`
]
.pull-right[
<img src="img/rvest.png" width="230" style="display: block; margin: auto 0 auto auto;" />
]

---

## Core rvest functions

- `read_html`   - Read HTML data from a url or character string
- `html_node `  - Select a specified node from HTML document
- `html_nodes`  - Select specified nodes from HTML document
- `html_table`  - Parse an HTML table into a data frame
- `html_text`   - Extract tag pairs' content
- `html_name`   - Extract tags' names
- `html_attrs`  - Extract all of each tag's attributes
- `html_attr`   - Extract tags' attribute value by name

---

## SelectorGadget

.pull-left[
- Open source tool that eases CSS selector generation and discovery
- Easiest to use with the [Chrome Extension](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) 
- Find out more on the [SelectorGadget vignette](https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html)
]
.pull-right[
<img src="img/selector-gadget.png" width="456" />
]

---

## Using the SelectorGadget

.pull-left[
- Click on the app logo next to the search bar
- A box will open in the bottom right of the website
]
.pull-right[
<img src="img/selector-gadget.gif" height="250" style="display: block; margin: auto;" />
]

- Click on a page element (it will turn green), SelectorGadget will generate a 
minimal CSS selector for that element, and will highlight (yellow) everything 
that is matched by the selector

--
- Click on a highlighted element to remove it from the selector (red), or 
click on an unhighlighted element to add it to the selector

--
- Through this process of selection and rejection, SelectorGadget helps you come 
up with the appropriate CSS selector for your needs

---

# Top 250 movies on IMDB

---

## Top 250 movies on IMDB

Take a look at the source code, look for the tag `table` tag:
<br>
http://www.imdb.com/chart/top

![imdb_top](img/imdb_top_250.png)

---

## First check if you're allowed!

```r
library(robotstxt)
paths_allowed("http://www.imdb.com")
```

```
## 
 www.imdb.com                      No encoding supplied: defaulting to UTF-8.
```

```
## [1] TRUE
```

vs. e.g.

```r
paths_allowed("http://www.facebook.com")
```

```
## 
 www.facebook.com
```

```
## [1] FALSE
```

---

## <i class="fas fa-laptop"></i> Hands on - IMDB 250

- Go to [rstudio.cloud](https://rstudio.cloud/spaces/34062/projects)  
- Start the assignment called *Hands on - Webscraping*
- Open `01-imdb-250movies.R`
- Follow along, and fill in the blanks if you like

---

## Select and format pieces

```r
page <- read_html("http://www.imdb.com/chart/top")

titles <- page %>%
  html_nodes(".titleColumn a") %>%
  html_text()

years <- page %>%
  html_nodes(".secondaryInfo") %>%
  html_text() %>%
  str_replace("\$", "") %>% # remove (
  str_replace("\$", "") %>% # remove )
  as.numeric()

scores <- page %>%
  html_nodes("#main strong") %>%
  html_text() %>%
  as.numeric()
  
imdb_top_250 <- tibble(
  title = titles, 
  year = years, 
  score = scores
  )
```
]

---

<div id="htmlwidget-a9c05f64a45797303f5a" style="width:100%;height:400px;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-a9c05f64a45797303f5a">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51","52","53","54","55","56","57","58","59","60","61","62","63","64","65","66","67","68","69","70","71","72","73","74","75","76","77","78","79","80","81","82","83","84","85","86","87","88","89","90","91","92","93","94","95","96","97","98","99","100","101","102","103","104","105","106","107","108","109","110","111","112","113","114","115","116","117","118","119","120","121","122","123","124","125","126","127","128","129","130","131","132","133","134","135","136","137","138","139","140","141","142","143","144","145","146","147","148","149","150","151","152","153","154","155","156","157","158","159","160","161","162","163","164","165","166","167","168","169","170","171","172","173","174","175","176","177","178","179","180","181","182","183","184","185","186","187","188","189","190","191","192","193","194","195","196","197","198","199","200","201","202","203","204","205","206","207","208","209","210","211","212","213","214","215","216","217","218","219","220","221","222","223","224","225","226","227","228","229","230","231","232","233","234","235","236","237","238","239","240","241","242","243","244","245","246","247","248","249","250"],["The Shawshank Redemption","The Godfather","The Godfather: Part II","The Dark Knight","12 Angry Men","Schindler's List","The Lord of the Rings: The Return of the King","Pulp Fiction","The Good, the Bad and the Ugly","Fight Club","The Lord of the Rings: The Fellowship of the Ring","Forrest Gump","Joker","Inception","Star Wars: Episode V - The Empire Strikes Back","The Lord of the Rings: The Two Towers","The Matrix","One Flew Over the Cuckoo's Nest","Goodfellas","Seven Samurai","Seven","City of God","La vita è bella","The Silence of the Lambs","It's a Wonderful Life","Star Wars: Episode IV - A New Hope","Saving Private Ryan","Spirited Away","The Green Mile","Leon","Interstellar","Harakiri","The Usual Suspects","The Lion King","American History X","Back to the Future","The Pianist","Modern Times (1936)","Terminator 2: Judgment Day","Untouchable","Psycho","Gladiator","City Lights","The Departed","Whiplash","Once Upon a Time in the West","The Prestige","Casablanca","Avengers: Endgame","Grave of the Fireflies","Rear Window","Cinema Paradiso","Alien","Raiders of the Lost Ark","Memento","Apocalypse Now","The Great Dictator","The Lives of Others","Django Unchained","Avengers: Infinity War","Spider-Man: Into the Spider-Verse","The Shining","Paths of Glory","WALL·E","Parasite","Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb","Princess Mononoke","Sunset Blvd.","Oldboy","Witness for the Prosecution","The Dark Knight Rises","Once Upon a Time in America","Aliens","American Beauty","Coco","Kimi no na wa.","Braveheart","Das Boot","3 Idiots","Taare Zameen Par","Star Wars: Episode VI - Return of the Jedi","Toy Story","Reservoir Dogs","Amadeus","Dangal","Inglourious Basterds","Good Will Hunting","M - Eine Stadt sucht einen Mörder","Requiem for a Dream","2001: A Space Odyssey","Vertigo","Eternal Sunshine of the Spotless Mind","Citizen Kane","Full Metal Jacket","The Hunt","Alfred Hitchcock's North by Northwest","A Clockwork Orange","Snatch","The Kid","Amélie","Bicycle Thieves","Scarface","Singin' in the Rain","Lawrence of Arabia","Taxi Driver","Toy Story 3","The Sting","Metropolis","For a Few Dollars More","Ikiru","Double Indemnity","Jodaeiye Nader az Simin","To Kill a Mockingbird","Indiana Jones and the Last Crusade","The Apartment","Capernaum","Up","L.A. Confidential","Monty Python and the Holy Grail","Incendies","Heat","Rashomon","Batman Begins","Yojimbo","Die Hard","Unforgiven","Green Book","Downfall","Bacheha-Ye aseman","Some Like It Hot","Howl's Moving Castle","Drishyam","The Great Escape","My Neighbour Totoro","Ran","A Beautiful Mind","All About Eve","Pan's Labyrinth","Casino","Come and See","El secreto de sus ojos","Lock, Stock and Two Smoking Barrels","Raging Bull","Babam ve Oglum","The Treasure of the Sierra Madre","The Wolf of Wall Street","Judgment at Nuremberg","Three Billboards Outside Ebbing, Missouri","Chinatown","The Gold Rush","Dial M for Murder","Inside Out","There Will Be Blood","V for Vendetta","The Seventh Seal","Warrior","Room","Trainspotting","No Country for Old Men","The Elephant Man","Andhadhun","The Sixth Sense","Shutter Island","The Thing","The Bridge on the River Kwai","Gone with the Wind","Blade Runner","The Third Man","On the Waterfront","Wild Strawberries","Eskiya","Jurassic Park","Finding Nemo","Gran Torino","Fargo","Kill Bill: Vol. 1","The Deer Hunter","Tôkyô monogatari","Stalker","The Big Lebowski","The Truman Show","Relatos salvajes","Mary and Max","Hacksaw Ridge","Gone Girl","In the Name of the Father","Sherlock Jr.","The General","How to Train Your Dragon","Mr. Smith Goes to Washington","The Grand Budapest Hotel","Memories of Murder","Persona","Before Sunrise","Catch Me If You Can","The Red Shoes","Into the Wild","The Wages of Fear","Cool Hand Luke","12 Years a Slave","Network","Prisoners","Monty Python's Life of Brian","Stand by Me","La passion de Jeanne d'Arc","Mad Max: Fury Road","Rang De Basanti","Platoon","Andrei Rublev","Rush","Million Dollar Baby","Hachi: A Dog's Tale","Logan","Barry Lyndon","Ben-Hur","Nausicaä of the Valley of the Wind","Hotel Rwanda","White Heat","Amores Perros","Harry Potter and the Deathly Hallows: Part 2","The 400 Blows","Spotlight","Ace in the Hole","Dead Poets Society","Rebecca","Rocky","Monsters, Inc.","Gangs of Wasseypur","It Happened One Night","La Haine","Ah-ga-ssi","Lagaan: Once Upon a Time in India","The Princess Bride","Munna Bhai M.B.B.S.","PK","Butch Cassidy and the Sundance Kid","Before Sunset","The Help","Swades: We, the People","Akira","The Terminator","In the Mood for Love","Paris, Texas","El ángel exterminador","Guardians of the Galaxy","Aladdin","Infernal Affairs","The Battle of Algiers","Laputa: Castle in the Sky","Throne of Blood"],[1994,1972,1974,2008,1957,1993,2003,1994,1966,1999,2001,1994,2019,2010,1980,2002,1999,1975,1990,1954,1995,2002,1997,1991,1946,1977,1998,2001,1999,1994,2014,1962,1995,1994,1998,1985,2002,1936,1991,2011,1960,2000,1931,2006,2014,1968,2006,1942,2019,1988,1954,1988,1979,1981,2000,1979,1940,2006,2012,2018,2018,1980,1957,2008,2019,1964,1997,1950,2003,1957,2012,1984,1986,1999,2017,2016,1995,1981,2009,2007,1983,1995,1992,1984,2016,2009,1997,1931,2000,1968,1958,2004,1941,1987,2012,1959,1971,2000,1921,2001,1948,1983,1952,1962,1976,2010,1973,1927,1965,1952,1944,2011,1962,1989,1960,2018,2009,1997,1975,2010,1995,1950,2005,1961,1988,1992,2018,2004,1997,1959,2004,2013,1963,1988,1985,2001,1950,2006,1995,1985,2009,1998,1980,2005,1948,2013,1961,2017,1974,1925,1954,2015,2007,2005,1957,2011,2015,1996,2007,1980,2018,1999,2010,1982,1957,1939,1982,1949,1954,1957,1996,1993,2003,2008,1996,2003,1978,1953,1979,1998,1998,2014,2009,2016,2014,1993,1924,1926,2010,1939,2014,2003,1966,1995,2002,1948,2007,1953,1967,2013,1976,2013,1979,1986,1928,2015,2006,1986,1966,2013,2004,2009,2017,1975,1959,1984,2004,1949,2000,2011,1959,2015,1951,1989,1940,1976,2001,2012,1934,1995,2016,2001,1987,2003,2014,1969,2004,2011,2004,1988,1984,2000,1984,1962,2014,1992,2002,1966,1986,1957],[9.2,9.1,9,9,8.9,8.9,8.9,8.9,8.8,8.8,8.8,8.8,8.7,8.7,8.7,8.7,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>title<\/th>\n      <th>year<\/th>\n      <th>score<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"p","pageLength":8,"columnDefs":[{"className":"dt-right","targets":[2,3]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[8,10,25,50,100]}},"evals":[],"jsHooks":[]}</script>

---

## Clean up / enhance

May or may not be a lot of work depending on how messy the data are

- See if you like what you got:

```r
glimpse(imdb_top_250)
```

```
## Observations: 250
## Variables: 3
## $ title <chr> "The Shawshank Redemption", "The Godfather", "The Godfathe…
## $ year  <dbl> 1994, 1972, 1974, 2008, 1957, 1993, 2003, 1994, 1966, 1999…
## $ score <dbl> 9.2, 9.1, 9.0, 9.0, 8.9, 8.9, 8.9, 8.9, 8.8, 8.8, 8.8, 8.8…
```

- Add a variable for rank

```r
imdb_top_250 <- imdb_top_250 %>%
  mutate(rank = 1:nrow(imdb_top_250))
```

---

<div id="htmlwidget-ae1b5230d77db8ff9573" style="width:100%;height:400px;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-ae1b5230d77db8ff9573">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51","52","53","54","55","56","57","58","59","60","61","62","63","64","65","66","67","68","69","70","71","72","73","74","75","76","77","78","79","80","81","82","83","84","85","86","87","88","89","90","91","92","93","94","95","96","97","98","99","100","101","102","103","104","105","106","107","108","109","110","111","112","113","114","115","116","117","118","119","120","121","122","123","124","125","126","127","128","129","130","131","132","133","134","135","136","137","138","139","140","141","142","143","144","145","146","147","148","149","150","151","152","153","154","155","156","157","158","159","160","161","162","163","164","165","166","167","168","169","170","171","172","173","174","175","176","177","178","179","180","181","182","183","184","185","186","187","188","189","190","191","192","193","194","195","196","197","198","199","200","201","202","203","204","205","206","207","208","209","210","211","212","213","214","215","216","217","218","219","220","221","222","223","224","225","226","227","228","229","230","231","232","233","234","235","236","237","238","239","240","241","242","243","244","245","246","247","248","249","250"],["The Shawshank Redemption","The Godfather","The Godfather: Part II","The Dark Knight","12 Angry Men","Schindler's List","The Lord of the Rings: The Return of the King","Pulp Fiction","The Good, the Bad and the Ugly","Fight Club","The Lord of the Rings: The Fellowship of the Ring","Forrest Gump","Joker","Inception","Star Wars: Episode V - The Empire Strikes Back","The Lord of the Rings: The Two Towers","The Matrix","One Flew Over the Cuckoo's Nest","Goodfellas","Seven Samurai","Seven","City of God","La vita è bella","The Silence of the Lambs","It's a Wonderful Life","Star Wars: Episode IV - A New Hope","Saving Private Ryan","Spirited Away","The Green Mile","Leon","Interstellar","Harakiri","The Usual Suspects","The Lion King","American History X","Back to the Future","The Pianist","Modern Times (1936)","Terminator 2: Judgment Day","Untouchable","Psycho","Gladiator","City Lights","The Departed","Whiplash","Once Upon a Time in the West","The Prestige","Casablanca","Avengers: Endgame","Grave of the Fireflies","Rear Window","Cinema Paradiso","Alien","Raiders of the Lost Ark","Memento","Apocalypse Now","The Great Dictator","The Lives of Others","Django Unchained","Avengers: Infinity War","Spider-Man: Into the Spider-Verse","The Shining","Paths of Glory","WALL·E","Parasite","Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb","Princess Mononoke","Sunset Blvd.","Oldboy","Witness for the Prosecution","The Dark Knight Rises","Once Upon a Time in America","Aliens","American Beauty","Coco","Kimi no na wa.","Braveheart","Das Boot","3 Idiots","Taare Zameen Par","Star Wars: Episode VI - Return of the Jedi","Toy Story","Reservoir Dogs","Amadeus","Dangal","Inglourious Basterds","Good Will Hunting","M - Eine Stadt sucht einen Mörder","Requiem for a Dream","2001: A Space Odyssey","Vertigo","Eternal Sunshine of the Spotless Mind","Citizen Kane","Full Metal Jacket","The Hunt","Alfred Hitchcock's North by Northwest","A Clockwork Orange","Snatch","The Kid","Amélie","Bicycle Thieves","Scarface","Singin' in the Rain","Lawrence of Arabia","Taxi Driver","Toy Story 3","The Sting","Metropolis","For a Few Dollars More","Ikiru","Double Indemnity","Jodaeiye Nader az Simin","To Kill a Mockingbird","Indiana Jones and the Last Crusade","The Apartment","Capernaum","Up","L.A. Confidential","Monty Python and the Holy Grail","Incendies","Heat","Rashomon","Batman Begins","Yojimbo","Die Hard","Unforgiven","Green Book","Downfall","Bacheha-Ye aseman","Some Like It Hot","Howl's Moving Castle","Drishyam","The Great Escape","My Neighbour Totoro","Ran","A Beautiful Mind","All About Eve","Pan's Labyrinth","Casino","Come and See","El secreto de sus ojos","Lock, Stock and Two Smoking Barrels","Raging Bull","Babam ve Oglum","The Treasure of the Sierra Madre","The Wolf of Wall Street","Judgment at Nuremberg","Three Billboards Outside Ebbing, Missouri","Chinatown","The Gold Rush","Dial M for Murder","Inside Out","There Will Be Blood","V for Vendetta","The Seventh Seal","Warrior","Room","Trainspotting","No Country for Old Men","The Elephant Man","Andhadhun","The Sixth Sense","Shutter Island","The Thing","The Bridge on the River Kwai","Gone with the Wind","Blade Runner","The Third Man","On the Waterfront","Wild Strawberries","Eskiya","Jurassic Park","Finding Nemo","Gran Torino","Fargo","Kill Bill: Vol. 1","The Deer Hunter","Tôkyô monogatari","Stalker","The Big Lebowski","The Truman Show","Relatos salvajes","Mary and Max","Hacksaw Ridge","Gone Girl","In the Name of the Father","Sherlock Jr.","The General","How to Train Your Dragon","Mr. Smith Goes to Washington","The Grand Budapest Hotel","Memories of Murder","Persona","Before Sunrise","Catch Me If You Can","The Red Shoes","Into the Wild","The Wages of Fear","Cool Hand Luke","12 Years a Slave","Network","Prisoners","Monty Python's Life of Brian","Stand by Me","La passion de Jeanne d'Arc","Mad Max: Fury Road","Rang De Basanti","Platoon","Andrei Rublev","Rush","Million Dollar Baby","Hachi: A Dog's Tale","Logan","Barry Lyndon","Ben-Hur","Nausicaä of the Valley of the Wind","Hotel Rwanda","White Heat","Amores Perros","Harry Potter and the Deathly Hallows: Part 2","The 400 Blows","Spotlight","Ace in the Hole","Dead Poets Society","Rebecca","Rocky","Monsters, Inc.","Gangs of Wasseypur","It Happened One Night","La Haine","Ah-ga-ssi","Lagaan: Once Upon a Time in India","The Princess Bride","Munna Bhai M.B.B.S.","PK","Butch Cassidy and the Sundance Kid","Before Sunset","The Help","Swades: We, the People","Akira","The Terminator","In the Mood for Love","Paris, Texas","El ángel exterminador","Guardians of the Galaxy","Aladdin","Infernal Affairs","The Battle of Algiers","Laputa: Castle in the Sky","Throne of Blood"],[1994,1972,1974,2008,1957,1993,2003,1994,1966,1999,2001,1994,2019,2010,1980,2002,1999,1975,1990,1954,1995,2002,1997,1991,1946,1977,1998,2001,1999,1994,2014,1962,1995,1994,1998,1985,2002,1936,1991,2011,1960,2000,1931,2006,2014,1968,2006,1942,2019,1988,1954,1988,1979,1981,2000,1979,1940,2006,2012,2018,2018,1980,1957,2008,2019,1964,1997,1950,2003,1957,2012,1984,1986,1999,2017,2016,1995,1981,2009,2007,1983,1995,1992,1984,2016,2009,1997,1931,2000,1968,1958,2004,1941,1987,2012,1959,1971,2000,1921,2001,1948,1983,1952,1962,1976,2010,1973,1927,1965,1952,1944,2011,1962,1989,1960,2018,2009,1997,1975,2010,1995,1950,2005,1961,1988,1992,2018,2004,1997,1959,2004,2013,1963,1988,1985,2001,1950,2006,1995,1985,2009,1998,1980,2005,1948,2013,1961,2017,1974,1925,1954,2015,2007,2005,1957,2011,2015,1996,2007,1980,2018,1999,2010,1982,1957,1939,1982,1949,1954,1957,1996,1993,2003,2008,1996,2003,1978,1953,1979,1998,1998,2014,2009,2016,2014,1993,1924,1926,2010,1939,2014,2003,1966,1995,2002,1948,2007,1953,1967,2013,1976,2013,1979,1986,1928,2015,2006,1986,1966,2013,2004,2009,2017,1975,1959,1984,2004,1949,2000,2011,1959,2015,1951,1989,1940,1976,2001,2012,1934,1995,2016,2001,1987,2003,2014,1969,2004,2011,2004,1988,1984,2000,1984,1962,2014,1992,2002,1966,1986,1957],[9.2,9.1,9,9,8.9,8.9,8.9,8.9,8.8,8.8,8.8,8.8,8.7,8.7,8.7,8.7,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8],[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>title<\/th>\n      <th>year<\/th>\n      <th>score<\/th>\n      <th>rank<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"p","pageLength":8,"columnDefs":[{"className":"dt-right","targets":[2,3,4]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[8,10,25,50,100]}},"evals":[],"jsHooks":[]}</script>

---

## Analyze

```r
imdb_top_250 %>% 
  filter(year == 1995)
```

```
## # A tibble: 8 x 4
##   title               year score  rank
##   <chr>              <dbl> <dbl> <int>
## 1 Seven               1995   8.6    21
## 2 The Usual Suspects  1995   8.5    33
## 3 Braveheart          1995   8.3    77
## 4 Toy Story           1995   8.3    82
## 5 Heat                1995   8.2   121
## 6 Casino              1995   8.2   139
## 7 Before Sunrise      1995   8.1   194
## 8 La Haine            1995   8     230
```

---

## Analyze

.question[
How would you go about answering this question: Which years have the most movies on the list?
]

```r
imdb_top_250 %>% 
  group_by(year) %>%
  summarise(total = n()) %>%
  arrange(desc(total)) %>%
  head(5)
```

```
## # A tibble: 5 x 2
##    year total
##   <dbl> <int>
## 1  1995     8
## 2  1957     7
## 3  2004     7
## 4  2014     7
## 5  2000     6
```

---

## Visualize

.question[
How would you go about creating this visualization: Visualize the average yearly score for movies that made it on the top 250 list over time.
]

## Potential challenges

- Unreliable formatting at the source
- Data broken into many pages
- ...

.question[
Compare the display of information at [gumtree.com/cars-vans-motorbikes/edinburgh](https://www.gumtree.com/cars-vans-motorbikes/edinburgh) to the list on the IMDB top 250 list. What challenges can you foresee in scraping a list of the available apartments?
]

---

## <i class="fas fa-laptop"></i> Hands on - IMDB TV Shows

- Continue the assignment in [rstudio.cloud](https://rstudio.cloud/spaces/34062/projects): *Hands on - Webscraping*
- Open `02-imdb-tvshows.R`
- Scrape the names, scores, and years of most popular TV shows on IMDB:
[www.imdb.com/chart/tvmeter](http://www.imdb.com/chart/tvmeter)
- Create a data frame called `tvshows` with four variables: `rank`, `name`, `score`, `year` 
- Examine each of the **first three** TV shows to also obtain 
.midi[
  - Genre
- Runtime
- How many episodes so far
- First five plot keywords
]
- Add this information to the `tvshows` data frame you created earlier