4 Scraping the Web

Now that we have some understanding of HTML and CSS, meaning that we can specify the information we want to extract from a website, it is finally time to actually start scraping the web!

We can scrape in five steps:

  1. Get permission
  2. Download a page
  3. Extract data: Inspect elements and use CSS selectors
  4. Wrangle scraped data
  5. Scale up to multiple pages

4.1 Get Permission

The first step in any web scraping project is to make sure we are following the rules. We can use bow() to check the robots.txt file, as we saw previously. You will have to go through the Terms of Service and consider copyright and privacy on your own, though.

bow() will only let us scrape a site if the robots.txt file allows. Recall that https://www.google.com/robots.txt contains these lines, meaning that we are not allowed to scrape the URL google.com/search:

```
User-agent: *
Disallow: /search
```

Here is the message after we bow() to this URL:

bow("http://www.google.com/search")
## <polite session> http://www.google.com/search
##     User-agent: polite R package
##     robots.txt: 316 rules are defined for 4 bots
##    Crawl delay: 5 sec
##   The path is not scrapable for this user-agent

The last line tells us we cannot scrape it. If we do try anyways, we will not be able to do so:

bow("http://www.google.com/search") |> 
  scrape()
## Warning: No scraping allowed here!
## NULL

On the other hand, we should be able to scrape https://scrapethissite.com/pages/simple since this is its robots.txt file:

```
User-agent: *
Disallow: /lessons/
Disallow: /faq/
```

If we try bow() to it this is the result:

bow("https://scrapethissite.com/pages/simple")
## <polite session> https://scrapethissite.com/pages/simple
##     User-agent: polite R package
##     robots.txt: 2 rules are defined for 1 bots
##    Crawl delay: 5 sec
##   The path is scrapable for this user-agent

And we can scrape it without error:

dat <-
  bow("https://scrapethissite.com/pages/simple") |> 
  scrape()

Nice, now we do not have to read the robots.txt file!

4.2 Download

As we just saw, after using bow(), the next step is to scrape(). Scraping a single page only takes two functions:

dat <-
  bow("https://scrapethissite.com/pages/simple") |> 
  scrape()

Now we have this data downloaded as a data object, dat. This intermediate step of storing the webpage as an object, rather than proceeding to data extraction in the same pipe, allows us to scrape our offline copy again and again without sending repeated requests to the website while we experiment with CSS selectors.

4.3 Extract Data

Now that we have our webpage downloaded, it is time to pull out the data we want by writing CSS selectors that we pass to html_elements(). After we have our elements selected, we can then extract specific data from these elements.

We have already seen html_text2(), which pulls the content from the element. Two others we will find useful are html_attr(), which gets attribute values, and html_table(), which turns table elements into dataframes.

4.3.1 Attributes

Attribute values? Why would we want to get those? Recall that a elements have an href attribute that contains a URL. We can use html_attr() to pull these URLs, and then scrape these pages! If we have a page with links to many other pages, we can easily scrape all of them.

We can see this in action with http://www.scrapethissite.com/pages/simple/. First, take a look at this page and all the links it contains: mostly other pages on the site, and one external link to GitHub. We can pull out the link by first getting elements with an href attribute and then pull the value of those attributes. html_attr() just takes the name of the attribute in quotes, so there is no need to use the square brackets we use for CSS selectors ([href]).

library(polite)
library(rvest)

dat <- 
  bow("http://www.scrapethissite.com/pages/simple/") |> 
  scrape()

dat |> 
  html_elements("[href]") |> 
  html_attr("href")
##  [1] "/static/images/scraper-icon.png"                                          
##  [2] "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css"    
##  [3] "https://fonts.googleapis.com/css?family=Lato:400,700"                     
##  [4] "/static/css/styles.css"                                                   
##  [5] "https://lipis.github.io/flag-icon-css/css/flag-icon.css"                  
##  [6] "/"                                                                        
##  [7] "/pages/"                                                                  
##  [8] "/lessons/"                                                                
##  [9] "/faq/"                                                                    
## [10] "/login/"                                                                  
## [11] "/lessons/"                                                                
## [12] "http://peric.github.io/GetCountries/"                                     
## [13] "https://cdnjs.cloudflare.com/ajax/libs/pnotify/2.1.0/pnotify.core.min.css"

Some of the href attributes in this page were links to external CSS documents, which we did not see when viewing the page in the browser, though we did see the styles they produced! These style links were within link elements, which also have an href attribute. To leave these links out and just keep our anchor elements, we can combine two CSS selectors to get anchor elements (a) with href attributes: a[href].

Note that some links start with “https://” (or “http://”), and others with “/”. Navigating to links starting with “https://” is easy enough. In links starting with a slash, these addresses start from the root directory of the website and they are called root-relative paths. The root directory is typically a URL with nothing after the .com (.org, .edu, etc.), and it is also where we find the robots.txt file. In this example, the root directory is http://www.scrapethissite.com/. To get to “/pages”, we just prefix it with the root directory: http://www.scrapethissite.com/pages.

We can write a few lines of code to return full URLs by checking if the first character is a slash (substr(urls, 1, 1) == "/"), and if so, prefix it with the root directory (paste0("http://www.scrapethissite.com", urls)). We will also modify the CSS selector to only pick anchor elements, too (html_elements("a[href]")).

urls <- 
  dat |> 
  html_elements("a[href]") |> 
  html_attr("href") 

urls_full <-
  ifelse(substr(urls, 1, 1) == "/",
         paste0("http://www.scrapethissite.com", urls),
         urls)

urls_full
## [1] "http://www.scrapethissite.com/"        
## [2] "http://www.scrapethissite.com/pages/"  
## [3] "http://www.scrapethissite.com/lessons/"
## [4] "http://www.scrapethissite.com/faq/"    
## [5] "http://www.scrapethissite.com/login/"  
## [6] "http://www.scrapethissite.com/lessons/"
## [7] "http://peric.github.io/GetCountries/"

If we had a reason to do so, we could scrape these URLs. In this example, however, we cannot scrape all of these links since some are disallowed by robots.txt.

4.3.2 Tables

Tables are another element we may want to scrape. html_table() returns a list of dataframes made from the table elements in a webpage.

https://scrapethissite.com/pages/forms/ has a table element, and we can scrape it with the code below. We do not need to use html_elements() with a CSS selector before html_table(), but we might want to in some situations where we only want one of several tables.

library(polite)

page_with_table <- 
  bow("https://scrapethissite.com/pages/forms/") |> 
  scrape()
library(rvest)

myTable <-
  page_with_table |> 
  html_table()

myTable
## [[1]]
## # A tibble: 25 × 9
##    `Team Name`            Year  Wins Losses `OT Losses` `Win %` `Goals For (GF)`
##    <chr>                 <int> <int>  <int> <lgl>         <dbl>            <int>
##  1 Boston Bruins          1990    44     24 NA            0.55               299
##  2 Buffalo Sabres         1990    31     30 NA            0.388              292
##  3 Calgary Flames         1990    46     26 NA            0.575              344
##  4 Chicago Blackhawks     1990    49     23 NA            0.613              284
##  5 Detroit Red Wings      1990    34     38 NA            0.425              273
##  6 Edmonton Oilers        1990    37     37 NA            0.463              272
##  7 Hartford Whalers       1990    31     38 NA            0.388              238
##  8 Los Angeles Kings      1990    46     24 NA            0.575              340
##  9 Minnesota North Stars  1990    27     39 NA            0.338              256
## 10 Montreal Canadiens     1990    39     30 NA            0.487              273
## # ℹ 15 more rows
## # ℹ 2 more variables: `Goals Against (GA)` <int>, `+ / -` <int>

Even though we only scraped one table, it is in a list. To pull one table out of a list is simple enough, dat[[1]]. If we have several tables with the same columns and we want to combine them, bind_rows() from the dplyr package can be used since it accepts a list of dataframes:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
## 
##     group_rows
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df_list <-
  list(mtcars[1:2, ],
       mtcars[3:4, ],
       mtcars[5:6, ])

df_list
## [[1]]
##               mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4
## 
## [[2]]
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Datsun 710     22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 
## [[3]]
##                    mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.46 20.22  1  0    3    1
bind_rows(df_list)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

4.4 Wrangle

The data we scrape is bound to be a mess, and we need to work it into a usable format, usually a dataframe. This process is called data wrangling, which is not covered here, but, again, see our materials online and enroll in our training.

As a simple example, though, we can take the data from https://scrapethissite.com/pages/simple/ and turn it into a dataframe, with one row per country that contains the country name, capital, population, and area. Note that the approach below will only work if each country has all four values, since having more or fewer will cause the table values to be misaligned.

dat <-
  bow("https://scrapethissite.com/pages/simple/") |> 
  scrape()

Country <- dat |> html_elements(".country-name") |> html_text2()
Capital <- dat |> html_elements(".country-capital") |> html_text2()
Population <- dat |> html_elements(".country-population") |> html_text2()
Area <- dat |> html_elements(".country-area") |> html_text2()

dat_clean <-
  data.frame(Country = Country,
             Capital = Capital,
             Population = as.numeric(Population),
             Area = as.numeric(Area))

head(dat_clean)
##                Country          Capital Population   Area
## 1              Andorra Andorra la Vella      84000    468
## 2 United Arab Emirates        Abu Dhabi    4975593  82880
## 3          Afghanistan            Kabul   29121286 647500
## 4  Antigua and Barbuda       St. John's      86754    443
## 5             Anguilla       The Valley      13254    102
## 6              Albania           Tirana    2986952  28748

4.5 Scale Up

In Tables, we were able to scrape a table from https://scrapethissite.com/pages/forms/. If you look at this page, you can see by the pagination menu at the bottom that there are many pages, each with its own table.

After we have successfully scraped a first page as a test, our next step is to scale up and modify our code to scrape multiple URLs.

To scrape multiple pages, we need to gather all of our URLs, download multiple pages, and save HTML files to checkpoint our work.

4.5.1 Gathering URLs

The strategy we use for gathering all of our URLs depends on how they are structured and whether we know all values the parameters can take.

We will use https://scrapethissite.com/pages/forms/ for all three of our strategies.

4.5.1.1 Predictable URLs, Known Parameter Values

Navigate to https://scrapethissite.com/pages/forms/ in a browser. Scroll to the bottom, and click on “2” to be taken to the second page. Note what the URL is now:

http://www.scrapethissite.com/pages/forms/?page_num=2

Going to the third, fourth, or any other page will yield a URL following the same pattern. Note that we can also access the first page at https://www.scrapethissite.com/pages/forms/?page_num=1 as well as the shorter https://www.scrapethissite.com/pages/forms/.

With the knowledge that we have 24 pages in total, we can assemble all possible URLs by pasting our common prefix with the numbers 1-24:

paste0("http://scrapethissite.com/pages/forms/?page_num=", 1:24)
##  [1] "http://scrapethissite.com/pages/forms/?page_num=1" 
##  [2] "http://scrapethissite.com/pages/forms/?page_num=2" 
##  [3] "http://scrapethissite.com/pages/forms/?page_num=3" 
##  [4] "http://scrapethissite.com/pages/forms/?page_num=4" 
##  [5] "http://scrapethissite.com/pages/forms/?page_num=5" 
##  [6] "http://scrapethissite.com/pages/forms/?page_num=6" 
##  [7] "http://scrapethissite.com/pages/forms/?page_num=7" 
##  [8] "http://scrapethissite.com/pages/forms/?page_num=8" 
##  [9] "http://scrapethissite.com/pages/forms/?page_num=9" 
## [10] "http://scrapethissite.com/pages/forms/?page_num=10"
## [11] "http://scrapethissite.com/pages/forms/?page_num=11"
## [12] "http://scrapethissite.com/pages/forms/?page_num=12"
## [13] "http://scrapethissite.com/pages/forms/?page_num=13"
## [14] "http://scrapethissite.com/pages/forms/?page_num=14"
## [15] "http://scrapethissite.com/pages/forms/?page_num=15"
## [16] "http://scrapethissite.com/pages/forms/?page_num=16"
## [17] "http://scrapethissite.com/pages/forms/?page_num=17"
## [18] "http://scrapethissite.com/pages/forms/?page_num=18"
## [19] "http://scrapethissite.com/pages/forms/?page_num=19"
## [20] "http://scrapethissite.com/pages/forms/?page_num=20"
## [21] "http://scrapethissite.com/pages/forms/?page_num=21"
## [22] "http://scrapethissite.com/pages/forms/?page_num=22"
## [23] "http://scrapethissite.com/pages/forms/?page_num=23"
## [24] "http://scrapethissite.com/pages/forms/?page_num=24"

We only have one parameter in this example, “page_num”. We may sometimes have multiple parameters, each of which can take on several values. An easy way to find all possible combinations between a series of vectors is the crossing() function from the tidyr package.

As an example, let’s imagine we are scraping a site that has election data from 2012 and 2016 for every US state, all stored on separate pages. The parameter for year is “year” and state is “state”. We can just use five states for space reasons, and we will pull these from the state.abb vector.

library(tidyr)

urls <- 
  crossing("https://www.example.com/",
           "?state=",
           state.abb[1:5],
           "?year=",
           c(2012, 2016)) |> 
  unite("urls", sep = "") |> 
  pull(urls)

urls
##  [1] "https://www.example.com/?state=AK?year=2012"
##  [2] "https://www.example.com/?state=AK?year=2016"
##  [3] "https://www.example.com/?state=AL?year=2012"
##  [4] "https://www.example.com/?state=AL?year=2016"
##  [5] "https://www.example.com/?state=AR?year=2012"
##  [6] "https://www.example.com/?state=AR?year=2016"
##  [7] "https://www.example.com/?state=AZ?year=2012"
##  [8] "https://www.example.com/?state=AZ?year=2016"
##  [9] "https://www.example.com/?state=CA?year=2012"
## [10] "https://www.example.com/?state=CA?year=2016"

4.5.1.2 Predictable URLs, Unknown Parameter Values

Look again at this code we used earlier to generate a vector of URLs.

paste0("https://scrapethissite.com/pages/forms/?page_num=", 1:24)

We already knew there were 24 pages. What if we did not know the number of pages? Why, we could scrape it!

Recall that at the bottom of the page, there are links to every other page. The highest number corresponds to the number of pages. If we could extract the maximum number from the pagination buttons, we would have what we need for the URLs.

(As an alternative, we could get the href attribute of each button. However, many websites will not include a link to every single page, but there is usually a link to the last page, either as a number or as something like “Last”.)

First, download the first page.

dat <- 
  bow("http://www.scrapethissite.com/pages/forms/") |> 
  scrape()

After using the Inspector to explore the HTML code of this page, we find the code for the pagination buttons, here abbreviated with ... to just show 1, 2, and 24:

<div class="row pagination-area">
  <div class="col-md-10 text-center">
    <ul class="pagination">
      <li>
        <a href="/pages/forms/?page_num=1">
          <strong>
            1
          </strong>
        </a>
      </li>
      <li>
        <a href="/pages/forms/?page_num=2">
          2
        </a>
      </li>
...
      <li>
        <a href="/pages/forms/?page_num=24">
          24
        </a>
      </li>

After experimenting with CSS selectors, we find that .pagination>li will select all of the buttons. Passing the result to html_text2() will extract the text, which are numbers. These are character values, but R is able to parse these if we wrap them in as.numeric(). The maximum of these values is the highest page number, 24! The reason we need to specify na.rm = T in max() is that the “next page” button is a symbol, “»”, that becomes NA when coerced to a numeric value.

dat <- 
  bow("http://www.scrapethissite.com/pages/forms/") |> 
  scrape()

n_pages <- 
  dat |> 
  html_elements(".pagination>li") |> 
  html_text2() |> 
  as.numeric() |> 
  max(na.rm = T)

n_pages
## [1] 24

Then, all we have to do is create a vector from 1 to n_pages and paste it together with our URL prefix, and then we have our URLs to scrape!

urls <- paste0("https://scrapethissite.com/pages/forms/?page_num=", 1:n_pages)

If we had more than one parameter affix to our URLs, we could hunt through a page to see if that data is anywhere on the page, just as we found the number of pages in a pagination menu.

4.5.1.3 Unpredictable URLs

What if the URLs do not follow a predictable pattern, if the first page is “https://www.example.com/b87d3” and the second page is “https://www.example.com/934h8”? So long as each page has a link to the next, we can download a page, identify the “next page” button, get its URL, download the linked page, and repeat until there is no “next page” button.

This strategy is a bit tricky, and the implementation will surely vary from one site to the next. We will continue using http://www.scrapethissite.com/pages/forms/ even though we know the URLs follow a clear pattern.

Step 1. Download a page.

page1 <- 
  bow("http://www.scrapethissite.com/pages/forms/") |> 
  scrape()

Step 2. Find the next button, write a CSS selector for it, and pull the value of href.

next_page <-
  page1 |> 
  html_elements("[aria-label='Next']") |>  
  html_attr("href")

next_page
## [1] "/pages/forms/?page_num=1"

Step 3. Verify that this is is a full URL (starting with “http”) that we can scrape, and change it if not. This is a root-relative path, so prefix it with the website’s root URL.

next_page <-
  paste0("http://www.scrapethissite.com",
         next_page)

next_page
## [1] "http://www.scrapethissite.com/pages/forms/?page_num=1"

Step 4. Check the value of next_page if there is no next page, to build a condition for our scraper. In this case, there is no “next page” button on the last page. We can use purrr’s is_empty() function. It will return TRUE on the last page, but FALSE for our non-empty next_page.

library(purrr)

next_page_final <-
  bow("http://www.scrapethissite.com/pages/forms/?page_num=24") |> 
  scrape() |> 
  html_elements("[aria-label='Next']") |> 
  html_attr("href")

is_empty(next_page_final)
## [1] TRUE
is_empty(next_page)
## [1] FALSE

Step 5. Write a loop that downloads a page, gets the URL for the next page, and scrapes that, repeating until there is no next page. Initialize the loop with an empty list, all_dat, where we will store our downloaded pages, and the root-relative path for the first page as next_page. The logical test for our loop is while (!is_empty(next_page)), which will return TRUE until the last page. If we use a while loop, there will be no delay between scrapes since each bow() is independent, so we need to add a pause of five seconds (or more) with Sys.sleep(5).

# initialize with empty list and first page
all_dat <- list()
next_page <- "/pages/forms/?page_num=1"
i <- 0

# proceed if there is a next page from the last scrape
while (!is_empty(next_page)) {
  
  # increase iteration count
  i <- i + 1
  
  # assemble URL
  link <- 
    paste0("http://www.scrapethissite.com", 
           next_page)
  
  # download current page
  dat <- 
    bow(link) |> 
    scrape()
  
  # store downloaded page in list
  all_dat[[i]] <- dat
  
  # get link for next page
  next_page <- 
    dat |> 
    html_elements("[aria-label='Next']") |> 
    html_attr("href")
  
  # delay for five seconds before next download
  Sys.sleep(5)
}

This way of downloading pages is much more complex than our simple pasting of a URL and a known parameter range, but it handles unpredictable URLs with unknown parameter values. With a little more work, we can make an adaptive scraper!

Note that this last strategy is not simulating a session in a browser, and that this is not covered in this book. To learn more, see ?rvest::session.

4.5.2 Downloading Multiple Pages

The last example with unpredictable URLs returned a list of downloaded pages. The previous two that used predictable URLs ended with a character vector. To download all pages from a given character vector of URLs, use map() from purrr. Within map(), the first argument is our vector or list, and the second argument is the function we would like to apply to each element in the first argument. The function needs to be prefixed with ~, and we can reference the first argument with .x. Note that we do not need to manually specify a delay of five seconds this time since it is automatically applied.

library(purrr)

urls <- paste0("http://scrapethissite.com/pages/forms/?page_num=", 1:24)

dat_all <- 
  map(urls, 
      ~ bow(.x) |> scrape())

Then, map_dfr() can be used to go through each element of our page list and pull out a dataframe, and then row-bind all of them together.

dat_tables <- 
  map_dfr(dat_all, 
          ~ html_elements(.x, "table") |> html_table())

4.5.3 Saving HTML Files

Web scraping is not usually a one-day task. It is important to know how to save our downloaded webpages in order to checkpoint our work so that we can come back to it the next day. The result of scrape() cannot be saved as .RData or .RDS files, so we need to save them as HTML files.

If we only have one page, we can use write_html() to save our files, and read_html() to read them back in.

dat_single <- 
  bow("https://scrapethissite.com/pages/forms/") |> 
  scrape()

write_html(dat_single, "dat_single.html")

dat_single <- read_html("dat_single.html")

If we have multiple pages, as we did when we scraped 24 pages of a site, we could repeat the above code for each object in our list. Or, you may copy the code below for the functions write_html_list() and read_html_list(), which were written to save and load lists of HTML files. Tell each one where to save or look for your files with the folder argument.

read_html_list() will return an object called html_list_FOLDER, where FOLDER is the cleaned and truncated folder name.

The functions require xml2 and purrr, so be sure to have these installed on your computer.

# function for writing html files

write_html_list <- 
  function(x, folder = NULL) {
    
    if (is.null(folder)) {
      stop("Specify a folder where you want to save the HTML files.")
    }
    
    # create the specified directory if it does not already exist
    if (!folder %in% list.files()) {
      dir.create(folder)
    }
    
    # write pages as individual html files
    for (i in 1:length(x)) {
      xml2::write_html(x[[i]],
                       paste0(folder, "/page", i, ".html"))
    }
    
  }

# function for reading html files

read_html_list <-
  function(folder = NULL) {
    
    if (is.null(folder)) {
      stop("Specify the folder where your HTML files are saved.")
    }
    
    # find all html files in the specified directory
    html_files <- 
      paste0(folder, "/",
             list.files(path = folder, pattern = "((\\.html)|(\\.HTML))$"))
    
    # read files in as a list called "html_list_FOLDER"
    # clean folder name first to make it syntactically valid
        assign(paste0("html_list_", 
                  substr(gsub("[^[:alnum:]]", "", folder), 
                         1, 8)),
           purrr::map(html_files, xml2::read_html),
           envir = .GlobalEnv)
  }

Here is an example use of the two functions. Note that the object returned by read_html_list() will be named html_list_myfiles.

urls <- paste0("https://scrapethissite.com/pages/forms/?page_num=", 1:24)

dat_multiple <- 
  map(urls, 
      ~ bow(.x) |> scrape())

write_html_list(dat_multiple, folder = "myfiles")

read_html_list(folder = "myfiles")

4.6 Exercises

Scrape the names of turtles from http://www.scrapethissite.com/pages/frames/.

Hint: The main part of this page is an iframe, which you have to scrape by first going to its source.