1 Introduction

1.1 What is Web Scraping?

Web scraping is the process of collecting data from webpages.

Scraped data is especially useful for research in the social sciences because this data does not usually exist in an easily downloadable format suited to the research question. The simplest form of web scraping is copying and pasting text from a webpage into a document. However, this process is prone to human error, and it does not scale up well due to the amount of time it takes for the researcher. Automating this process overcomes these shortcomings, but in turn it forces researchers to deal with the fact that online data is already formatted but probably not in the format we want1 and validity threats introduced by the implementation of arbitrary criteria when scraping and analyzing data2.

Through recognizing the limitations of online data and implementing a principled and transparent approach to web scraping, “fields of gold” are made available to the researcher.3

This article serves as an introduction to web scraping with R. After reading the materials and completing the exercises, you should be able to:

  • Apply best practices for friendly web scraping
  • Write CSS selectors that grab relevant HTML elements
  • Extract text, links, and tables from webpages
  • Scrape multiple pages
  • Save and load HTML files

1.2 Being a Good Scraper of the Web

The act of scraping the web and the scraped data introduce issues of legality, copyright, and privacy, as well as the practical matter of how the website handles traffic from your scraper.

When scraping, it is important to keep these things in mind. This section does not constitute legal advice. Rather, the intention is to make you aware of these issues and provide guidelines whereby you can avoid trouble and not overwhelm or crash websites.

1.2.1 robots.txt

The robots.txt file can be found at the root level of a website (typically, right after the .com, org, etc.) and gives instructions to web crawlers and scrapers as to what areas of the websites they can scrape data. If we do not adhere to the guidelines in this file, we risk being blocked by the website.

For example, browse these robots.txt files:

In each one, you will find instructions for various “user-agents”, names of software. We are interested in the User-agent: * section, where * is a wildcard character that gives instructions to any bot not elsewhere listed in that text file, including us. The lines that follow and begin with Allow: or Disallow: tell us what parts of the site we can scrape. In Google’s robots.txt file, we see

User-agent: *
Disallow: /search
Allow: /search/about

The leading slash / in the (dis)allowed paths indicate that they start from the root directory. “/search” refers to google.com/search. These lines mean that we can scrape google.com/search/about, but not anything else starting with google.com/search. In other words, we cannot program a scraper to run Google searches for us.

At the bottom of the LinkedIn robots.txt file, we see

User-agent: *
Disallow: /

# Notice: If you would like to crawl LinkedIn,
# please email whitelist-crawl@linkedin.com to apply
# for white listing.

This is as straightforward as it comes. We cannot scrape any part of the website, though we could contact LinkedIn to request access.

The entire robots.txt file for Scrape this Site is as follows

User-agent: *
Disallow: /lessons/
Disallow: /faq/

This tells us that, apart from URLs starting with scrapethissite.com/lessons or scrapethissite.com/faq, we can scrape the site as we please. We will be using this site for several of our exercises.

For more reading on robots.txt files, see this Wikipedia page or Google’s Introduction.

The good news is that we do not have to read through the robots.txt files when scraping. We will use the polite R package, which checks this file for us and the URL we want to scrape, only letting us scrape if we are allowed to.

1.2.2 Speed

Most webpages are created with people in mind, not only in their format but also in the traffic they can handle. Compared to the manual scraping method of copy-paste, automated web scrapers are very, very fast. It is possible to crash a website by sending too many requests, called a Denial-of-Service (DoS) attack.

The polite R package also imposes a minimum of five seconds’ delay between scrapes, and we can make this longer if we want. We may also see longer limits in the robots.txt file, such as these lines from https://pythonscraping.com/robots.txt:

User-agent: *
Crawl-delay: 10

More good news: polite will see this line and adjust our scraping speed. We will begin each scrape with bow(), which checks whether we can scrape a site. Notice the difference in the fourth line of the output for these two websites (“Crawl delay”). polite is checking the robots.txt file for us!

## <polite session> http://www.scrapethissite.com/
##     User-agent: polite R package
##     robots.txt: 2 rules are defined for 1 bots
##    Crawl delay: 5 sec
##   The path is scrapable for this user-agent
## <polite session> http://pythonscraping.com/
##     User-agent: polite R package
##     robots.txt: 36 rules are defined for 1 bots
##    Crawl delay: 10 sec
##   The path is scrapable for this user-agent

1.2.3 Legality

Just because the robots.txt file allows us to scrape a site does not mean it is legal to download and distribute the data we scrape.

There is no function we can use to quickly determine if we are scraping legally. We need to do our due diligence in reading a site’s Terms of Service, considering whether we are scraping copyrighted data, and figuring out if we are collecting personally identifiable information for which we would need IRB approval for human subjects research.

1.2.4 When in Doubt, Ask

If you are not sure whether your scraping is allowed by a site, email the website owner and ask if you can scrape it.

If you are not sure about the legality of scraping a site, email them, and consult with your advisor, the IRB, and/or a lawyer.

1.2.5 Other Means of Obtaining Data

Some websites, such as the Census, offer their data in a downloadable format. Others allow for data access through Application Programming Interfaces (APIs). APIs are outside the scope of this book, but see RStudio’s talks on Using Web APIs from R and on The Basics, and Some of the Pitfalls, of Calling Web APIs from Within R.

1.3 First Example

With web scraping, we have a lot to learn before we can actually scrape, so let’s first look at where we are headed. Below is a short script that scrapes the three tables on https://elections.countyofdane.com/Election-Result/135 and puts them together in a list called results.

What do you notice about the script?

Download the webpage:


dat <- 
  bow("https://elections.countyofdane.com/Election-Result/135") |> 

Scrape the tables and give them names:


results <- 
  dat |> 

names(results) <-
  dat |> 
  html_elements("h4") |> 

## $`Representative to the Assembly District 37 - Official Canvass`
## # A tibble: 4 × 3
##   Candidate                          `Vote Percentage` `Number of Votes`
##   <chr>                              <chr>             <chr>            
## 1 Dem Pete Adams (Dem)               59.2%             1,161            
## 2 Rep William Penterman (Rep)        38.7%             758              
## 3 IND Stephen W. Ratzlaff, Jr. (IND) 2.1%              41               
## 4 Write-in (Non)                     0.0%              0                
## $`County Supervisor District 19 - Official Canvass`
## # A tibble: 3 × 3
##   Candidate               `Vote Percentage` `Number of Votes`
##   <chr>                   <chr>                         <int>
## 1 Timothy Rockwell (Non)  52.4%                           610
## 2 Kristen M. Morris (Non) 47.5%                           553
## 3 Write-in (Non)          0.1%                              1
## # A tibble: 3 × 2
##   X1                      X2    
##   <chr>                   <chr> 
## 1 Registered Voters Total 21,787
## 2 Ballots Cast Total      3,127 
## 3 Turnout Percentage      14.4%

We can observe a few things in this script:

  1. We see “html” several times in the code.
  2. The script is very short.
  3. The results are messy. The “percent” columns are all character data, and the third dataframe does not have any column names.

On the first point, this means we have to learn some of the basics of HTML (and CSS).

On the second point, although the code is simple, a bit of work had to be done to figure out how to scrape the elements we wanted. how did we determine that the names of results should come from html_elements("h4")? What is h4? We will learn that in the next chapter.

On the third point, this is where data wrangling has a role to play. Wrangling, or tidying, data is outside the scope of this material, but see our materials online and enroll in our training!

  1. Marres, N., & Weltevrede, E. (2013). Scraping the social? Issues in Live Social Research, 6(3), 313-335. https://doi.org/10.1080/17530350.2013.772070↩︎

  2. Xu, H., Zhang, N., & Zhou, L. (2020). Validity concerns in research using organic data. Journal of Management, 46(7), 1257–1274. https://doi.org/10.1177/0149206319862027↩︎

  3. Boegershausen, J., Borah, A., & Stephen, A. (2021). Fields of gold: Web scraping for consumer research. Marketing Science Institute Working Paper Series 2021. Available at https://www.msi.org/working-papers/fields-of-gold-web-scraping-for-consumer-research/↩︎