8 Scale Up

In Tables, we were able to scrape a table from https://scrapethissite.com/pages/forms/. If you look at this page, you can see by the pagination menu at the bottom that there are many pages, each with its own table.

After we have successfully scraped a first page as a test, our next step is to scale up and modify our code to scrape multiple URLs.

To scrape multiple pages, we need to:

Gather all the URLs we want to scrape
Download multiple pages
Save HTML files to checkpoint our work

8.1 Gathering URLs

The strategy we use for gathering all of our URLs depends on how they are structured and whether we know all values the parameters can take.

We will use https://scrapethissite.com/pages/forms/ for all three of our strategies.

8.1.1 Predictable URLs, Known Parameter Values

Navigate to https://scrapethissite.com/pages/forms/ in a browser. Scroll to the bottom and click on “2” to be taken to the second page. Note what the URL is now:

http://www.scrapethissite.com/pages/forms/?page_num=2

Going to the third, fourth, or any other page will yield a URL following the same pattern. Note that we can also access the first page at https://www.scrapethissite.com/pages/forms/?page_num=1 in addition to the shorter https://www.scrapethissite.com/pages/forms/.

With the knowledge that we have 24 pages in total, we can assemble all possible URLs by pasting our common prefix with the numbers 1-24:

paste0("http://scrapethissite.com/pages/forms/?page_num=", 1:24)

##  [1] "http://scrapethissite.com/pages/forms/?page_num=1" 
##  [2] "http://scrapethissite.com/pages/forms/?page_num=2" 
##  [3] "http://scrapethissite.com/pages/forms/?page_num=3" 
##  [4] "http://scrapethissite.com/pages/forms/?page_num=4" 
##  [5] "http://scrapethissite.com/pages/forms/?page_num=5" 
##  [6] "http://scrapethissite.com/pages/forms/?page_num=6" 
##  [7] "http://scrapethissite.com/pages/forms/?page_num=7" 
##  [8] "http://scrapethissite.com/pages/forms/?page_num=8" 
##  [9] "http://scrapethissite.com/pages/forms/?page_num=9" 
## [10] "http://scrapethissite.com/pages/forms/?page_num=10"
## [11] "http://scrapethissite.com/pages/forms/?page_num=11"
## [12] "http://scrapethissite.com/pages/forms/?page_num=12"
## [13] "http://scrapethissite.com/pages/forms/?page_num=13"
## [14] "http://scrapethissite.com/pages/forms/?page_num=14"
## [15] "http://scrapethissite.com/pages/forms/?page_num=15"
## [16] "http://scrapethissite.com/pages/forms/?page_num=16"
## [17] "http://scrapethissite.com/pages/forms/?page_num=17"
## [18] "http://scrapethissite.com/pages/forms/?page_num=18"
## [19] "http://scrapethissite.com/pages/forms/?page_num=19"
## [20] "http://scrapethissite.com/pages/forms/?page_num=20"
## [21] "http://scrapethissite.com/pages/forms/?page_num=21"
## [22] "http://scrapethissite.com/pages/forms/?page_num=22"
## [23] "http://scrapethissite.com/pages/forms/?page_num=23"
## [24] "http://scrapethissite.com/pages/forms/?page_num=24"

We only have one parameter in this example, “page_num”. We may sometimes have multiple parameters, each of which can take on several values. An easy way to find all possible combinations between a series of vectors is the crossing() function from the tidyr package.

As an example, let’s imagine we are scraping a site that has election data from 2012 and 2016 for every US state, all stored on separate pages. The parameter for year is “year” and state is “state”. We can just use five states for space reasons, and we will pull these from the state.abb vector.

library(tidyr)

urls <- 
  crossing("https://www.example.com/",
           "?state=",
           state.abb[1:5],
           "?year=",
           c(2012, 2016)) |> 
  unite("urls", sep = "") |> 
  pull(urls)

urls

##  [1] "https://www.example.com/?state=AK?year=2012"
##  [2] "https://www.example.com/?state=AK?year=2016"
##  [3] "https://www.example.com/?state=AL?year=2012"
##  [4] "https://www.example.com/?state=AL?year=2016"
##  [5] "https://www.example.com/?state=AR?year=2012"
##  [6] "https://www.example.com/?state=AR?year=2016"
##  [7] "https://www.example.com/?state=AZ?year=2012"
##  [8] "https://www.example.com/?state=AZ?year=2016"
##  [9] "https://www.example.com/?state=CA?year=2012"
## [10] "https://www.example.com/?state=CA?year=2016"

8.1.2 Predictable URLs, Unknown Parameter Values

Look again at this code we used earlier to generate a vector of URLs.

paste0("https://scrapethissite.com/pages/forms/?page_num=", 1:24)

We already knew there were 24 pages. What if we did not know the number of pages? We could scrape it!

Recall that at the bottom of the page, there are links to every other page. The highest number corresponds to the number of pages. If we could extract the maximum number from the pagination buttons, we would have what we need for the URLs.

(As an alternative, we could get the href attribute of each button. However, many websites will not include a link to every single page, but there is usually a link to the last page, either as a number or as a word like “Last”.)

First, download the first page.

dat <- 
  bow("http://www.scrapethissite.com/pages/forms/") |> 
  scrape()

After using the Inspector to explore the HTML code of this page, we find the code for the pagination buttons, here abbreviated with ... to just show 1, 2, and 24:

<div class="row pagination-area">
  <div class="col-md-10 text-center">
    <ul class="pagination">
      <li>
        <a href="/pages/forms/?page_num=1">
          <strong>
            1
          </strong>
        </a>
      </li>
      <li>
        <a href="/pages/forms/?page_num=2">
          2
        </a>
      </li>
...
      <li>
        <a href="/pages/forms/?page_num=24">
          24
        </a>
      </li>

After experimenting with CSS selectors, we find that .pagination>li will select all of the buttons. Passing the result to html_text2() will extract the text, which are numbers. These are character values, but R is able to parse these if we wrap them in as.numeric(). The maximum of these values is the highest page number, 24! We do need to specify na.rm = T in max() since the “next page” button is a symbol, “»”, which becomes NA when coerced to a numeric value.

dat <- 
  bow("http://www.scrapethissite.com/pages/forms/") |> 
  scrape()

n_pages <- 
  dat |> 
  html_elements(".pagination>li") |> 
  html_text2() |> 
  as.numeric() |> 
  max(na.rm = T)

n_pages

## [1] 24

Then, all we have to do is create a vector from 1 to n_pages and paste it together with our URL prefix, and then we have our URLs to scrape!

urls <- paste0("https://scrapethissite.com/pages/forms/?page_num=", 1:n_pages)

If we had more than one parameter affix to our URLs, we could hunt through a page to see if that data is anywhere on the page, just as we found the number of pages in a pagination menu.

8.1.3 Unpredictable URLs

What if the URLs do not follow a predictable pattern, if the first page is “https://www.example.com/b87d3” and the second page is “https://www.example.com/934h8”? So long as each page has a link to the next, we can download a page, identify the “next page” button, get its URL, download the linked page, and repeat until there is no “next page” button.

This strategy is a bit involved, and its implementation will surely vary from one site to the next. (Automated scrapers must be custom built for each site we want to scrape.) We will continue using http://www.scrapethissite.com/pages/forms/ even though we know the URLs follow a clear pattern.

Step 1. Download a page.

page1 <- 
  bow("http://www.scrapethissite.com/pages/forms/") |> 
  scrape()

Step 2. Find the next button, write a CSS selector for it, and pull the value of href.

next_page <-
  page1 |> 
  html_elements("[aria-label='Next']") |>  
  html_attr("href")

next_page

## [1] "/pages/forms/?page_num=1"

Step 3. Verify that this is is a full URL (starting with “http”) that we can scrape, and change it if not. This is a root-relative path, so prefix it with the website’s root URL.

next_page <-
  paste0("http://www.scrapethissite.com",
         next_page)

next_page

## [1] "http://www.scrapethissite.com/pages/forms/?page_num=1"

Step 4. Check the value of next_page if there is no next page. This will help us create a “stop” condition for our scraper. In this case, there is no “next page” button on the last page. We can use purrr’s is_empty() function. It will return TRUE on the last page, but FALSE for our non-empty next_page.

library(purrr)

next_page_final <-
  bow("http://www.scrapethissite.com/pages/forms/?page_num=24") |> 
  scrape() |> 
  html_elements("[aria-label='Next']") |> 
  html_attr("href")

is_empty(next_page_final)

## [1] TRUE

is_empty(next_page)

## [1] FALSE

Step 5. Write a loop that downloads a page, gets the URL for the next page, and scrapes that, repeating until there is no next page. Initialize the loop with an empty list, all_dat, where we will store our downloaded pages, and the root-relative path for the first page as next_page. The logical test for our loop is while (!is_empty(next_page)), which will return TRUE until the last page. If we use a while loop, there will be no delay between scrapes since each bow() is independent, so we need to add a pause of five seconds (or more) with Sys.sleep(5).

# initialize with empty list and first page
all_dat <- list()
next_page <- "/pages/forms/?page_num=1"
i <- 0

# proceed if there is a next page from the last scrape
while (!is_empty(next_page)) {
  
  # increase iteration count
  i <- i + 1
  
  # assemble URL
  link <- 
    paste0("http://www.scrapethissite.com", 
           next_page)
  
  # download current page
  dat <- 
    bow(link) |> 
    scrape()
  
  # store downloaded page in list
  all_dat[[i]] <- dat
  
  # get link for next page
  next_page <- 
    dat |> 
    html_elements("[aria-label='Next']") |> 
    html_attr("href")
  
  # delay for five seconds before next download
  Sys.sleep(5)
}

This way of downloading pages is much more complex than our simple pasting of a URL and a known parameter range, but it handles unpredictable URLs with unknown parameter values. With a little more work, we can make an adaptive scraper!

An alternative approach would be to simulate a browser session. To learn more, see ?rvest::session.

8.2 Downloading Multiple Pages

The last example with unpredictable URLs returned a list of downloaded pages. The previous two that used predictable URLs ended with a character vector of URLs. To download all pages from a given character vector of URLs, use the map() function from purrr. In map(), the first argument is our vector or list, and the second argument is the function we would like to apply to each element in the first argument. To learn more about map(), read Functions and Iterations in R. Note that we do not need to manually specify a delay of five seconds this time since it is automatically applied.

library(purrr)

urls <- paste0("http://scrapethissite.com/pages/forms/?page_num=", 1:24)

dat_all <- 
  map(urls, 
      \(x) bow(x) |> scrape())

We can map() the html_table() function to each of the pages we downloaded to get a dataframe, and then use bind_rows() to merge them together.

dat_tables <- 
  map(dat_all, 
      \(x) html_elements(x, "table") |> html_table()) |> 
  bind_rows()

8.3 Saving HTML Files

Web scraping is never a one-day task, so we need to save our downloaded webpages in a way we can continue our work the next day. The result of scrape() cannot be saved as .RData or .RDS files, so we need to save them as HTML files.

If we only have one page, we can use write_html() to save our files, and read_html() to read them back in.

dat_single <- 
  bow("https://scrapethissite.com/pages/forms/") |> 
  scrape()

write_html(dat_single, "dat_single.html")

dat_single <- read_html("dat_single.html")

If we have multiple pages, as we did when we scraped 24 pages of a site, we could repeat the above code for each object in our list, but copying and pasting code is bad practice.

Instead, copy and run the code below to create two functions, write_html_list() and read_html_list(), which save and load lists of HTML files. Tell each one where to save or look for your files with the folder argument.

read_html_list() will return an object called html_list_FOLDER, where FOLDER is the cleaned and truncated folder name.

The functions require xml2 and purrr, so be sure to have these installed on your computer.

# function for writing html files

write_html_list <- 
  function(x, folder = NULL) {
    
    if (is.null(folder)) {
      stop("Specify a folder where you want to save the HTML files.")
    }
    
    # create the specified directory if it does not already exist
    if (!folder %in% list.files()) {
      dir.create(folder)
    }
    
    # write pages as individual html files
    purrr::iwalk(x,
                 \(x, i) xml2::write_html(x, paste0(folder, "/page", i, ".html")))
  }

# function for reading html files

read_html_list <-
  function(folder = NULL) {
    
    if (is.null(folder)) {
      stop("Specify the folder where your HTML files are saved.")
    }
    
    # find all html files in the specified directory
    html_files <- 
      paste0(folder, "/",
             list.files(path = folder, pattern = "((\\.html)|(\\.HTML))$"))
    
    # read files in as a list called "html_list_FOLDER"
    # clean folder name first to make it syntactically valid
    assign(paste0("html_list_", 
                  gsub("[^[:alnum:]]", "", folder)),
           purrr::map(html_files, xml2::read_html),
           envir = .GlobalEnv)
  }

Here is an example use of the two functions. Note that the object returned by read_html_list() will be named html_list_*myfiles*.

urls <- paste0("https://scrapethissite.com/pages/forms/?page_num=", 1:24)

dat_multiple <- 
  map(urls, 
      \(x) bow(x) |> scrape())

write_html_list(dat_multiple, folder = "myfiles")

read_html_list(folder = "myfiles")

Your workflow with a larger web scraping project will start with downloading all the pages, and then saving them to disk on your computer. Subsequent days of work will involve reading those files back into R from disk, writing CSS selectors, and extracting the data you need.

8.4 Exercises

Scrape the names of turtles from http://www.scrapethissite.com/pages/frames/.

Hint: The main part of this page is an iframe, which you have to scrape by first going to its source.