3 CSS

We’ll use what are called “CSS selectors” to extract elements from a webpage. CSS stands for Cascading Style Sheets, and it takes the form of documents with rules that determine how a webpage is displayed. For example, the code below dictates that all h1 elements (level-one headers) should be blue and bold.

h1 {
  color: blue;
  font-weight: bold;
}

CSS allows us to quickly and easily alter the appearance of a webpage just by changing one or two lines in our style sheet. If we change color: blue in the code above to color: red, all h1 elements will now appear red. Other common uses of CSS include specifying the width of the margins, as well as the colors of links before and after being clicked. Have you ever noticed that a page appears differently on your phone versus your computer, or that many links appear blue before you click them but purple afterward? That’s CSS!

When web scraping, we are not interested in styling our webpages. Instead, we are most interested in the first line of the CSS code above: h1. This is an element selector, which finds all h1 elements and applies the following styles to them.

We can use CSS selectors to scrape information from webpages. If we want to pull out all of the h1 elements from a webpage, perhaps because they contain the information we want for our research, we can do that. We can also extract elements by other element components, such as class, id, or attribute, or by the context in which they appear: which elements they are nested within or follow.

3.1 Inspecting Elements

To get the information necessary for writing CSS selectors, we need to first view the page we want to scrape in our browser with the “Inspector”, which allows us to explore the HTML source code for the webpage.

How we open the Inspector varies by browser, and instructions for four major browsers are below.

Browser Enable Developer Mode Open Inspector Element Selection Button
Chrome Not applicable. Right-click on the page and select Inspect.
Click on the Select an Element button.
Edge Not applicable. Right-click on the page and select Inspect.
Click on the Select an Element button.
Firefox Not applicable. Right-click on the page and select Inspect.
Click on the Pick an Element button.
Safari Click on Safari in the menu bar.
Click on Preferences.
Click on the Advanced tab.
Check “Show Develop menu in menu bar.”
Right-click on the page and select Inspect Element.
Click on the Start Element Selection button.

Opening the Inspector and inspecting elements is accomplished in four steps. The “Open Inspector” column in the table above corresponds to Steps 1 and 2 below. Steps 3 and 4 show how we can take a closer look at elements in a webpage. These images were produced with Chrome on a Mac, and although each OS-browser combination will appear slightly differently, the steps remain the same.

Step 1. Right-click on the page and select Inspect. This opens up a split-window mode with our webpage and the Inspector.

right-click and select inspect

Step 2. Click on the Select an Element button.

click on the select an element button

Step 3. Hover over an element on the webpage. A CSS selector for that element can be seen in the popup, but it does not necessarily uniquely identify the element.

hover over an element on the webpage

Step 4. Click on an element to find the HTML code producing that element. In addition to seeing information about the element, we also see important contextual information that can help us when writing CSS selectors.

browse the HTML in the inspector

3.2 CSS Selectors

Now that we have an understanding of the basics of HTML elements and their relationships, and how to look up that information for a particular element, we are ready to select them with CSS.

CSS selectors provide a way to extract (filter) HTML elements by their element name, class, id, and other attributes, as well as by the relationships they share with other elements. Do you want all h1 elements? We can do that. Do you want all elements where class="special"? We can do that. Do you want all h3 elements that are the sibling immediately following p elements that are the children of h1 elements with class="special"? Yes, we can do that too.

For now, we will use CSS selectors to get elements and then extract their content (text). Later in Extract Data, we will use other functions to attribute values and tables.

The table below contains a few of the CSS selectors you may find useful when web scraping.

The “Selection Criteria” column contains the different ways we can select elements: by their components or relationships, and by combining these with logical operators (and, or, not). The “Operator” column gives the symbol (if any) we need to use, and the “Example” column has example usage. The final column, “Result of html_text2()”, gives the output of taking the values in “Example” and substituting them for “SELECTOR” in this code, where countries is the webpage we used in Relationships Between Elements:

countries |> html_elements("SELECTOR") |> html_text2()

For example, here is the example from the first line:

countries |> html_elements("h3") |> html_text2()
## [1] "Country Data" "Bulgaria"     "Bahrain"

The output matches what we see in the final column, except without quotes.

Selection Criteria Operator Example Result of html_text2()
Element Type (none) h3 Country Data, Bulgaria , Bahrain
Class . .country-capital Sofia , Manama
ID # #footer Adapted from Scrape This Site
Attribute [] [href] Scrape This Site
Attribute Starting with Value [ ^=’ ’] [href^=‘https’] Scrape This Site
Attribute Ending with Value [ \(=' '] </td> <td style="text-align:left;"> [target\)=‘ank’] Scrape This Site
Attribute Containing Value [ *=’ ’] [class*=‘attrib’] Scrape This Site
Anything/Everything * * (Everything)
Descendants (space) div h3 Bulgaria, Bahrain
Children > div>.country-area 110910.0, 665.0
Subsequent Sibling (After) ~ span~span 7148785 , 110910.0, 738004 , 665.0
Next Sibling (Immediately After) + br+strong Population:, Area (km2):, Population:, Area (km2):
Logical AND (no space) h3.country-name Bulgaria, Bahrain
Logical OR , h3 , .country-area Country Data, Bulgaria , 110910.0 , Bahrain , 665.0
Logical NOT :not() span:not(.country-area) Sofia , 7148785, Manama , 738004

Note that we can select descendants, children, later siblings, but not ancestors, parents, or previous siblings. This is because HTML documents are processed from top to bottom. It is not possible to go the other way, either up or backward in our HTML tree. We can find an element and pick its children, but we cannot find those children and then reverse course to pick their parent.

This table is far from exhaustive. For more on CSS selectors, see this extensive table of CSS selectors and this technical report on CSS selectors.

3.3 Exercises

  1. Play this game to master the basics of CSS selectors.

Use the countries example page for the following questions, which you can download with this code:

library(polite)
library(rvest)

countries <-
  bow("https://sscc.wisc.edu/sscc/pubs/data/webscraping-r/countries_excerpt.html") |> 
  scrape()
  1. Select all div elements.

  2. Select all elements where class="country-population".

  3. Select all elements with a class attribute.

  4. Select all elements whose class attribute starts with “c”.

  5. Select all elements whose class attribute ends with “area”.

  6. Select all elements whose class attribute contains “p”.

  7. Select all elements that are descendants of a section element.

  8. Select all elements that are children of a body element.

  9. Select all elements that are subsequent siblings of an h3 element.

  10. Select all elements that are next siblings of an h3 element.

    • How does this compare to the output of the previous item?

3.3.1 Challenge Exercises

Continue using the countries example page.

  1. How would you get output of "2" "2"?

  2. We already saw two ways to get just the country names “Bulgaria” and “Bahrain”. Find at least two more ways to get them.

  3. Select anything (*) immediately after an h3 element.

  4. What will each of the following return? Look at the HTML code above and try to figure it out by yourself before running the code.

    • countries |> html_elements(".country-capital+*") |> html_text2()
    • countries |> html_elements(".country-capital~*:not(br)") |> html_text2()
    • countries |> html_elements("br+*") |> html_text2()
    • countries |> html_elements("[target]") |> html_text2()
    • countries |> html_elements("h3>*") |> html_text2()
    • countries |> html_elements(".row") |> html_text2()
    • countries |> html_elements("[class*='y']") |> html_text2()
    • countries |> html_elements("span[class*='y']") |> html_text2()