7 Wrangle
The data we scrape is bound to be a mess, and we need to wrangle it into a usable format, which usually means a dataframe. To learn more about data wrangling, see our materials online and enroll in our training.
As a simple example, we can take the data from https://scrapethissite.com/pages/simple/ and turn it into a dataframe, with one row per country that contains the country name, capital, population, and area. Note that the approach below will only work if each country has all four values, since having more or fewer will cause the table values to be misaligned.
dat <-
bow("https://scrapethissite.com/pages/simple/") |>
scrape()
Country <- dat |> html_elements(".country-name") |> html_text2()
Capital <- dat |> html_elements(".country-capital") |> html_text2()
Population <- dat |> html_elements(".country-population") |> html_text2()
Area <- dat |> html_elements(".country-area") |> html_text2()
dat_clean <-
data.frame(Country = Country,
Capital = Capital,
Population = as.numeric(Population),
Area = as.numeric(Area))
head(dat_clean)
## Country Capital Population Area
## 1 Andorra Andorra la Vella 84000 468
## 2 United Arab Emirates Abu Dhabi 4975593 82880
## 3 Afghanistan Kabul 29121286 647500
## 4 Antigua and Barbuda St. John's 86754 443
## 5 Anguilla The Valley 13254 102
## 6 Albania Tirana 2986952 28748