6 Extract Data
Now that we have our webpage downloaded, it is time to extract the data we want.
To do this, we need to
- Inspect elements
- Write CSS selectors
- Pass our selectors to
- Use a function to extract specific data from the elements returned by step 3
On step 4, we have already seen html_text2()
, which pulls the content from the element. Two others we will find useful are html_attr()
, which gets attribute values, and html_table()
, which turns table
elements into dataframes.
6.1 Attributes
Attribute values? Why would we want to get those? Recall that a
elements have an href
attribute that contains a URL. If we have a page with many links, we can select those elements and then use html_attr()
to pull their URLs. After we have a vector of URLs, we can scrape those pages!
For example, look at http://www.scrapethissite.com/pages/simple/. This page has links to other pages on the site, and one external link to GitHub. We can extract the links by first getting elements with an href
attribute and then pulling the value of those attributes. html_attr()
just takes the name of the attribute in quotes, so there is no need to use the square brackets we use for CSS selectors ([href]
dat <-
bow("http://www.scrapethissite.com/pages/simple/") |>
dat |>
html_elements("[href]") |>
## [1] "/static/images/scraper-icon.png"
## [2] "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css"
## [3] "https://fonts.googleapis.com/css?family=Lato:400,700"
## [4] "/static/css/styles.css"
## [5] "https://lipis.github.io/flag-icon-css/css/flag-icon.css"
## [6] "/"
## [7] "/pages/"
## [8] "/lessons/"
## [9] "/faq/"
## [10] "/login/"
## [11] "/lessons/"
## [12] "http://peric.github.io/GetCountries/"
## [13] "https://cdnjs.cloudflare.com/ajax/libs/pnotify/2.1.0/pnotify.core.min.css"
Some of the href
attributes in this page were links to external font and CSS documents, which we did not see when viewing the page in the browser. These style links were within link
elements, which also have an href
attribute. To leave these links out and just keep our anchor elements, we can combine two CSS selectors to get anchor elements (a
) with href
attributes: a[href]
dat |>
html_elements("a[href]") |>
## [1] "/"
## [2] "/pages/"
## [3] "/lessons/"
## [4] "/faq/"
## [5] "/login/"
## [6] "/lessons/"
## [7] "http://peric.github.io/GetCountries/"
Some links start with “http://” or “https://”, and others with “/”. Navigating to links starting with “https://” is easy enough. In links starting with a slash, these addresses start from the root directory of the website and they are called root-relative paths. The root directory is typically a URL with nothing after the .com (.org, .edu, etc.), and it is also where we find the robots.txt file. In this example, the root directory is http://www.scrapethissite.com/. To get to “/pages”, we just prefix it with the root directory: http://www.scrapethissite.com/pages.
We can write a few lines of code to return full URLs by prefixing the link with the root directory if the first character is a slash.
urls <-
dat |>
html_elements("a[href]") |>
urls_full <-
ifelse(substr(urls, 1, 1) == "/",
paste0("http://www.scrapethissite.com", urls),
## [1] "http://www.scrapethissite.com/"
## [2] "http://www.scrapethissite.com/pages/"
## [3] "http://www.scrapethissite.com/lessons/"
## [4] "http://www.scrapethissite.com/faq/"
## [5] "http://www.scrapethissite.com/login/"
## [6] "http://www.scrapethissite.com/lessons/"
## [7] "http://peric.github.io/GetCountries/"
If we had a reason to do so, we could scrape these URLs. In this example, however, we cannot scrape all of these links since some are disallowed by robots.txt:
map(urls_full, bow)
## [[1]]
## <polite session> http://www.scrapethissite.com/
## User-agent: polite R package
## robots.txt: 2 rules are defined for 1 bots
## Crawl delay: 5 sec
## The path is scrapable for this user-agent
## [[2]]
## <polite session> http://www.scrapethissite.com/pages/
## User-agent: polite R package
## robots.txt: 2 rules are defined for 1 bots
## Crawl delay: 5 sec
## The path is scrapable for this user-agent
## [[3]]
## <polite session> http://www.scrapethissite.com/lessons/
## User-agent: polite R package
## robots.txt: 2 rules are defined for 1 bots
## Crawl delay: 5 sec
## The path is not scrapable for this user-agent
## [[4]]
## <polite session> http://www.scrapethissite.com/faq/
## User-agent: polite R package
## robots.txt: 2 rules are defined for 1 bots
## Crawl delay: 5 sec
## The path is not scrapable for this user-agent
## [[5]]
## <polite session> http://www.scrapethissite.com/login/
## User-agent: polite R package
## robots.txt: 2 rules are defined for 1 bots
## Crawl delay: 5 sec
## The path is scrapable for this user-agent
## [[6]]
## <polite session> http://www.scrapethissite.com/lessons/
## User-agent: polite R package
## robots.txt: 2 rules are defined for 1 bots
## Crawl delay: 5 sec
## The path is not scrapable for this user-agent
## [[7]]
## <polite session> http://peric.github.io/GetCountries/
## User-agent: polite R package
## robots.txt: 1 rules are defined for 1 bots
## Crawl delay: 5 sec
## The path is scrapable for this user-agent
6.2 Tables
Tables are another element we may want to scrape. html_table()
returns a list of dataframes made from the table
elements in a webpage.
https://scrapethissite.com/pages/forms/ has a table
element, and we can scrape it with the code below. We do not need to use html_elements()
with a CSS selector before html_table()
, but we might want to in some situations where we only want one of several tables.
page_with_table <-
bow("https://scrapethissite.com/pages/forms/") |>
myTable <-
page_with_table |>
## [[1]]
## # A tibble: 25 × 9
## `Team Name` Year Wins Losses `OT Losses` `Win %` `Goals For (GF)`
## <chr> <int> <int> <int> <lgl> <dbl> <int>
## 1 Boston Bruins 1990 44 24 NA 0.55 299
## 2 Buffalo Sabres 1990 31 30 NA 0.388 292
## 3 Calgary Flames 1990 46 26 NA 0.575 344
## 4 Chicago Blackhawks 1990 49 23 NA 0.613 284
## 5 Detroit Red Wings 1990 34 38 NA 0.425 273
## 6 Edmonton Oilers 1990 37 37 NA 0.463 272
## 7 Hartford Whalers 1990 31 38 NA 0.388 238
## 8 Los Angeles Kings 1990 46 24 NA 0.575 340
## 9 Minnesota North Stars 1990 27 39 NA 0.338 256
## 10 Montreal Canadiens 1990 39 30 NA 0.487 273
## # ℹ 15 more rows
## # ℹ 2 more variables: `Goals Against (GA)` <int>, `+ / -` <int>
Even though we only scraped one table, it is in a list. If our list has only one table, we can extract it with myTable[[1]]
. If we have several tables that share column names, we can combine them with bind_rows()
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
## group_rows
## The following objects are masked from 'package:stats':
## filter, lag
## The following objects are masked from 'package:base':
## intersect, setdiff, setequal, union
df_list <-
list(mtcars[1:2, ],
mtcars[3:4, ],
mtcars[5:6, ])
## [[1]]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
## [[2]]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## [[3]]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1