7 Wrangle

The data we scrape is bound to be a mess, and we need to wrangle it into a usable format, which usually means a dataframe. To learn more about data wrangling, see our materials online and enroll in our training.

As a simple example, we can take the data from https://scrapethissite.com/pages/simple/ and turn it into a dataframe, with one row per country that contains the country name, capital, population, and area. Note that the approach below will only work if each country has all four values, since having more or fewer will cause the table values to be misaligned.

dat <-
  bow("https://scrapethissite.com/pages/simple/") |> 
  scrape()

Country <- dat |> html_elements(".country-name") |> html_text2()
Capital <- dat |> html_elements(".country-capital") |> html_text2()
Population <- dat |> html_elements(".country-population") |> html_text2()
Area <- dat |> html_elements(".country-area") |> html_text2()

dat_clean <-
  data.frame(Country = Country,
             Capital = Capital,
             Population = as.numeric(Population),
             Area = as.numeric(Area))

head(dat_clean)
##                Country          Capital Population   Area
## 1              Andorra Andorra la Vella      84000    468
## 2 United Arab Emirates        Abu Dhabi    4975593  82880
## 3          Afghanistan            Kabul   29121286 647500
## 4  Antigua and Barbuda       St. John's      86754    443
## 5             Anguilla       The Valley      13254    102
## 6              Albania           Tirana    2986952  28748