4 Get Permission

Now we have some understanding of HTML and CSS, meaning that we can specify the information we want to extract from a website. It is finally time to actually start scraping the web!

Scraping requires five steps:

  1. Get permission
  2. Download a page
  3. Extract data
  4. Wrangle scraped data
  5. Scale up to multiple pages

The first step in any web scraping project is to make sure we are following the rules. We can use bow() to check the robots.txt file, but you will have to manually read the Terms of Service and consider copyright and privacy on your own.

bow() will only let us scrape a site if the robots.txt file allows. Recall that https://www.google.com/robots.txt contains these lines, meaning that we are not allowed to scrape the URL google.com/search:

```
User-agent: *
Disallow: /search
```

Here is the message after we bow() to this URL:

library(polite)
bow("http://www.google.com/search")
## <polite session> http://www.google.com/search
##     User-agent: polite R package
##     robots.txt: 328 rules are defined for 4 bots
##    Crawl delay: 5 sec
##   The path is not scrapable for this user-agent

The last line tells us we cannot scrape it. If we try anyways, we will not be able to do so:

bow("http://www.google.com/search") |> 
  scrape()
## Warning: No scraping allowed here!
## NULL

On the other hand, we are allowed to scrape https://scrapethissite.com/pages/simple since this is its robots.txt file:

```
User-agent: *
Disallow: /lessons/
Disallow: /faq/
```

If we try bow() to it, this is the result:

bow("https://scrapethissite.com/pages/simple")
## <polite session> https://scrapethissite.com/pages/simple
##     User-agent: polite R package
##     robots.txt: 2 rules are defined for 1 bots
##    Crawl delay: 5 sec
##   The path is scrapable for this user-agent

And we can scrape it without error:

dat <-
  bow("https://scrapethissite.com/pages/simple") |> 
  scrape()