9 Advanced Techniques

9.1 Changing User-Agents

Part of what shows up in web traffic logs is the user-agent, which typically says something about your operating system and browser. If you Google “my user-agent”, you will see yours. Here is mine:

“Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36”

The default user-agent with bow() is “polite R package - https://github.com/dmi3kno/polite”.

It is recommended that you change this to give your contact information, so that the website admin can contact you if they see your scrapes in their logs:

bow(url, user_agent = "I am scraping your site for a research project. You can contact me at bbadger@wisc.edu")

However, some sites may check your user-agent to see if the request is coming from a browser, and not allow content to load. In this case, neither bow()’s default nor our friendly message will register as a request from a real browser. (For an example, compare what you see when you visit http://www.scrapethissite.com/pages/advanced/?gotcha=headers in a browser versus when you scrape it.)

To get around this issue, you can simply use a normal user-agent when scraping, such as,

bow(url, user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36")

Another strategy is user-agent rotation, where we use a different user-agent for each page we scrape. To implement this, copy (or scrape!) agents from this list of agents, place them in a vector called agents and then sample from this vector each scrape:

library(purrr)

dat <- 
  map(url, 
      \(x) bow(x, user_agent = sample(agents, 1)) |> scrape())

Even if you do this, your IP address is still in the web traffic logs, so you are not anonymous. The way around this would be to use proxy servers, which is out of scope for this introduction.

At this point, though, if you are spoofing and rotating your user-agents and hiding your IP address, you should stop and check if you are adhering to the site’s Terms of Service and robots.txt. Refer to the section Being a Good Scraper of the Web, and also the court case eBay v. Bidder’s Edge before you consider taking such steps to scrape a site.

9.2 Long Scrapes

If the job will take more than a few minutes to scrape, submit it to Slurm. Of course, first test your script on a few pages to make sure it works fine!

It is even more important to save downloaded pages as HTML files on disk when doing long scrapes. If you later realize your data extractor fails on some pages, or if you later decide you also needed another element’s content, you would have to run your long scrape all over again!