3 CSS
We’ll use what are called “CSS selectors” to extract elements from a webpage. CSS stands for Cascading Style Sheets, and it takes the form of documents with rules that determine how a webpage is displayed. For example, the code below dictates that all h1
elements (level-one headers) should be blue and bold.
h1 {
color: blue;
font-weight: bold;
}
CSS allows us to quickly and easily alter the appearance of a webpage just by changing one or two lines in our style sheet. If we change color: blue
in the code above to color: red
, all h1
elements will now appear red. Other common uses of CSS include specifying the width of the margins, as well as the colors of links before and after being clicked. Have you ever noticed that a page appears differently on your phone versus your computer, or that many links appear blue before you click them but purple afterward? That’s CSS!
When web scraping, we are not interested in styling our webpages. Instead, we are most interested in the first line of the CSS code above: h1
. This is an element selector, which finds all h1
elements and applies the following styles to them.
We can use CSS selectors to scrape information from webpages. If we want to pull out all of the h1
elements from a webpage, perhaps because they contain the information we want for our research, we can do that. We can also extract elements by other element components, such as class, id, or attribute, or by the context in which they appear: which elements they are nested within or follow.
3.1 Inspecting Elements
To get the information necessary for writing CSS selectors, we need to first view the page we want to scrape in our browser with the “Inspector”, which allows us to explore the HTML source code for the webpage.
How we open the Inspector varies by browser, and instructions for four major browsers are below.
Browser | Enable Developer Mode | Open Inspector | Element Selection Button |
---|---|---|---|
Chrome | Not applicable. |
Right-click on the page and select Inspect. Click on the Select an Element button. |
|
Edge | Not applicable. |
Right-click on the page and select Inspect. Click on the Select an Element button. |
|
Firefox | Not applicable. |
Right-click on the page and select Inspect. Click on the Pick an Element button. |
|
Safari |
Click on Safari in the menu bar. Click on Preferences. Click on the Advanced tab. Check “Show Develop menu in menu bar.” |
Right-click on the page and select Inspect Element. Click on the Start Element Selection button. |
Opening the Inspector and inspecting elements is accomplished in four steps. The “Open Inspector” column in the table above corresponds to Steps 1 and 2 below. Steps 3 and 4 show how we can take a closer look at elements in a webpage. These images were produced with Chrome on a Mac, and although each OS-browser combination will appear slightly differently, the steps remain the same.
Step 1. Right-click on the page and select Inspect. This opens up a split-window mode with our webpage and the Inspector.
Step 2. Click on the Select an Element button.
Step 3. Hover over an element on the webpage. A CSS selector for that element can be seen in the popup, but it does not necessarily uniquely identify the element.
Step 4. Click on an element to find the HTML code producing that element. In addition to seeing information about the element, we also see important contextual information that can help us when writing CSS selectors.
3.2 CSS Selectors
Now that we have an understanding of the basics of HTML elements and their relationships, and how to look up that information for a particular element, we are ready to select them with CSS.
CSS selectors provide a way to extract (filter) HTML elements by their element name, class, id, and other attributes, as well as by the relationships they share with other elements. Do you want all h1
elements? We can do that. Do you want all elements where class="special"
? We can do that. Do you want all h3
elements that are the sibling immediately following p
elements that are the children of h1
elements with class="special"
? Yes, we can do that too.
For now, we will use CSS selectors to get elements and then extract their content (text). Later in Extract Data, we will use other functions to attribute values and tables.
The table below contains a few of the CSS selectors you may find useful when web scraping.
The “Selection Criteria” column contains the different ways we can select elements: by their components or relationships, and by combining these with logical operators (and, or, not). The “Operator” column gives the symbol (if any) we need to use, and the “Example” column has example usage. The final column, “Result of html_text2()”, gives the output of taking the values in “Example” and substituting them for “SELECTOR” in this code, where countries
is the webpage we used in Relationships Between Elements:
countries |> html_elements("SELECTOR") |> html_text2()
For example, here is the example from the first line:
countries |> html_elements("h3") |> html_text2()
## [1] "Country Data" "Bulgaria" "Bahrain"
The output matches what we see in the final column, except without quotes.
Selection Criteria | Operator | Example | Result of html_text2() |
---|---|---|---|
Element Type | (none) | h3 | Country Data, Bulgaria , Bahrain |
Class | . | .country-capital | Sofia , Manama |
ID | # | #footer | Adapted from Scrape This Site |
Attribute | [] | [href] | Scrape This Site |
Attribute Starting with Value | [ ^=’ ’] | [href^=‘https’] | Scrape This Site |
Attribute Ending with Value | [ \(=' '] </td> <td style="text-align:left;"> [target\)=‘ank’] | Scrape This Site | |
Attribute Containing Value | [ *=’ ’] | [class*=‘attrib’] | Scrape This Site |
Anything/Everything | * | * | (Everything) |
Descendants | (space) | div h3 | Bulgaria, Bahrain |
Children | > | div>.country-area | 110910.0, 665.0 |
Subsequent Sibling (After) | ~ | span~span | 7148785 , 110910.0, 738004 , 665.0 |
Next Sibling (Immediately After) | + | br+strong | Population:, Area (km2):, Population:, Area (km2): |
Logical AND | (no space) | h3.country-name | Bulgaria, Bahrain |
Logical OR | , | h3 , .country-area | Country Data, Bulgaria , 110910.0 , Bahrain , 665.0 |
Logical NOT | :not() | span:not(.country-area) | Sofia , 7148785, Manama , 738004 |
Note that we can select descendants, children, later siblings, but not ancestors, parents, or previous siblings. This is because HTML documents are processed from top to bottom. It is not possible to go the other way, either up or backward in our HTML tree. We can find an element and pick its children, but we cannot find those children and then reverse course to pick their parent.
This table is far from exhaustive. For more on CSS selectors, see this extensive table of CSS selectors and this technical report on CSS selectors.
3.3 Exercises
- Play this game to master the basics of CSS selectors.
Use the countries
example page for the following questions, which you can download with this code:
library(polite)
library(rvest)
countries <-
bow("https://sscc.wisc.edu/sscc/pubs/data/webscraping-r/countries_excerpt.html") |>
scrape()
Select all
div
elements.Select all elements where
class="country-population"
.Select all elements with a
class
attribute.Select all elements whose
class
attribute starts with “c”.Select all elements whose
class
attribute ends with “area”.Select all elements whose
class
attribute contains “p”.Select all elements that are descendants of a
section
element.Select all elements that are children of a
body
element.Select all elements that are subsequent siblings of an
h3
element.Select all elements that are next siblings of an
h3
element.- How does this compare to the output of the previous item?
3.3.1 Challenge Exercises
Continue using the countries
example page.
How would you get output of
"2" "2"
?We already saw two ways to get just the country names “Bulgaria” and “Bahrain”. Find at least two more ways to get them.
Select anything (
*
) immediately after anh3
element.What will each of the following return? Look at the HTML code above and try to figure it out by yourself before running the code.
countries |> html_elements(".country-capital+*") |> html_text2()
countries |> html_elements(".country-capital~*:not(br)") |> html_text2()
countries |> html_elements("br+*") |> html_text2()
countries |> html_elements("[target]") |> html_text2()
countries |> html_elements("h3>*") |> html_text2()
countries |> html_elements(".row") |> html_text2()
countries |> html_elements("[class*='y']") |> html_text2()
countries |> html_elements("span[class*='y']") |> html_text2()