6 Extract Data

Now that we have our webpage downloaded, it is time to extract the data we want.

To do this, we need to

  1. Inspect elements
  2. Write CSS selectors
  3. Pass our selectors to html_elements()
  4. Use a function to extract specific data from the elements returned by step 3

On step 4, we have already seen html_text2(), which pulls the content from the element. Two others we will find useful are html_attr(), which gets attribute values, and html_table(), which turns table elements into dataframes.

6.1 Attributes

Attribute values? Why would we want to get those? Recall that a elements have an href attribute that contains a URL. If we have a page with many links, we can select those elements and then use html_attr() to pull their URLs. After we have a vector of URLs, we can scrape those pages!

For example, look at http://www.scrapethissite.com/pages/simple/. This page has links to other pages on the site, and one external link to GitHub. We can extract the links by first getting elements with an href attribute and then pulling the value of those attributes. html_attr() just takes the name of the attribute in quotes, so there is no need to use the square brackets we use for CSS selectors ([href]).

library(polite)
library(rvest)

dat <- 
  bow("http://www.scrapethissite.com/pages/simple/") |> 
  scrape()

dat |> 
  html_elements("[href]") |> 
  html_attr("href")
##  [1] "/static/images/scraper-icon.png"                                          
##  [2] "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css"    
##  [3] "https://fonts.googleapis.com/css?family=Lato:400,700"                     
##  [4] "/static/css/styles.css"                                                   
##  [5] "https://lipis.github.io/flag-icon-css/css/flag-icon.css"                  
##  [6] "/"                                                                        
##  [7] "/pages/"                                                                  
##  [8] "/lessons/"                                                                
##  [9] "/faq/"                                                                    
## [10] "/login/"                                                                  
## [11] "/lessons/"                                                                
## [12] "http://peric.github.io/GetCountries/"                                     
## [13] "https://cdnjs.cloudflare.com/ajax/libs/pnotify/2.1.0/pnotify.core.min.css"

Some of the href attributes in this page were links to external font and CSS documents, which we did not see when viewing the page in the browser. These style links were within link elements, which also have an href attribute. To leave these links out and just keep our anchor elements, we can combine two CSS selectors to get anchor elements (a) with href attributes: a[href]:

dat |> 
  html_elements("a[href]") |> 
  html_attr("href")
## [1] "/"                                   
## [2] "/pages/"                             
## [3] "/lessons/"                           
## [4] "/faq/"                               
## [5] "/login/"                             
## [6] "/lessons/"                           
## [7] "http://peric.github.io/GetCountries/"

Some links start with “http://” or “https://”, and others with “/”. Navigating to links starting with “https://” is easy enough. In links starting with a slash, these addresses start from the root directory of the website and they are called root-relative paths. The root directory is typically a URL with nothing after the .com (.org, .edu, etc.), and it is also where we find the robots.txt file. In this example, the root directory is http://www.scrapethissite.com/. To get to “/pages”, we just prefix it with the root directory: http://www.scrapethissite.com/pages.

We can write a few lines of code to return full URLs by prefixing the link with the root directory if the first character is a slash.

urls <- 
  dat |> 
  html_elements("a[href]") |> 
  html_attr("href") 

urls_full <-
  ifelse(substr(urls, 1, 1) == "/",
         paste0("http://www.scrapethissite.com", urls),
         urls)

urls_full
## [1] "http://www.scrapethissite.com/"        
## [2] "http://www.scrapethissite.com/pages/"  
## [3] "http://www.scrapethissite.com/lessons/"
## [4] "http://www.scrapethissite.com/faq/"    
## [5] "http://www.scrapethissite.com/login/"  
## [6] "http://www.scrapethissite.com/lessons/"
## [7] "http://peric.github.io/GetCountries/"

If we had a reason to do so, we could scrape these URLs. In this example, however, we cannot scrape all of these links since some are disallowed by robots.txt:

library(purrr)
map(urls_full, bow)
## [[1]]
## <polite session> http://www.scrapethissite.com/
##     User-agent: polite R package
##     robots.txt: 2 rules are defined for 1 bots
##    Crawl delay: 5 sec
##   The path is scrapable for this user-agent
## 
## [[2]]
## <polite session> http://www.scrapethissite.com/pages/
##     User-agent: polite R package
##     robots.txt: 2 rules are defined for 1 bots
##    Crawl delay: 5 sec
##   The path is scrapable for this user-agent
## 
## [[3]]
## <polite session> http://www.scrapethissite.com/lessons/
##     User-agent: polite R package
##     robots.txt: 2 rules are defined for 1 bots
##    Crawl delay: 5 sec
##   The path is not scrapable for this user-agent
## 
## [[4]]
## <polite session> http://www.scrapethissite.com/faq/
##     User-agent: polite R package
##     robots.txt: 2 rules are defined for 1 bots
##    Crawl delay: 5 sec
##   The path is not scrapable for this user-agent
## 
## [[5]]
## <polite session> http://www.scrapethissite.com/login/
##     User-agent: polite R package
##     robots.txt: 2 rules are defined for 1 bots
##    Crawl delay: 5 sec
##   The path is scrapable for this user-agent
## 
## [[6]]
## <polite session> http://www.scrapethissite.com/lessons/
##     User-agent: polite R package
##     robots.txt: 2 rules are defined for 1 bots
##    Crawl delay: 5 sec
##   The path is not scrapable for this user-agent
## 
## [[7]]
## <polite session> http://peric.github.io/GetCountries/
##     User-agent: polite R package
##     robots.txt: 1 rules are defined for 1 bots
##    Crawl delay: 5 sec
##   The path is scrapable for this user-agent

6.2 Tables

Tables are another element we may want to scrape. html_table() returns a list of dataframes made from the table elements in a webpage.

https://scrapethissite.com/pages/forms/ has a table element, and we can scrape it with the code below. We do not need to use html_elements() with a CSS selector before html_table(), but we might want to in some situations where we only want one of several tables.

library(polite)

page_with_table <- 
  bow("https://scrapethissite.com/pages/forms/") |> 
  scrape()
library(rvest)

myTable <-
  page_with_table |> 
  html_table()

myTable
## [[1]]
## # A tibble: 25 × 9
##    `Team Name`            Year  Wins Losses `OT Losses` `Win %` `Goals For (GF)`
##    <chr>                 <int> <int>  <int> <lgl>         <dbl>            <int>
##  1 Boston Bruins          1990    44     24 NA            0.55               299
##  2 Buffalo Sabres         1990    31     30 NA            0.388              292
##  3 Calgary Flames         1990    46     26 NA            0.575              344
##  4 Chicago Blackhawks     1990    49     23 NA            0.613              284
##  5 Detroit Red Wings      1990    34     38 NA            0.425              273
##  6 Edmonton Oilers        1990    37     37 NA            0.463              272
##  7 Hartford Whalers       1990    31     38 NA            0.388              238
##  8 Los Angeles Kings      1990    46     24 NA            0.575              340
##  9 Minnesota North Stars  1990    27     39 NA            0.338              256
## 10 Montreal Canadiens     1990    39     30 NA            0.487              273
## # ℹ 15 more rows
## # ℹ 2 more variables: `Goals Against (GA)` <int>, `+ / -` <int>

Even though we only scraped one table, it is in a list. If our list has only one table, we can extract it with myTable[[1]]. If we have several tables that share column names, we can combine them with bind_rows():

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
## 
##     group_rows
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df_list <-
  list(mtcars[1:2, ],
       mtcars[3:4, ],
       mtcars[5:6, ])

df_list
## [[1]]
##               mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4
## 
## [[2]]
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Datsun 710     22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 
## [[3]]
##                    mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.46 20.22  1  0    3    1
bind_rows(df_list)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1