2 HTML

HTML is the language used to create webpages. When we copy and paste text from a webpage, we are interacting with the result of the HTML code, which tells our internet browsers how to display the content. When we scrape, we download the HTML code, so we need to dive into this language and understand its syntax so that we can extract the information we want. Webpages are designed for the user experience, not data collection, so it can get messy.

Take a look at this page: http://www.scrapethissite.com/pages/simple/

Imagine we want to scrape all of the country names off of the page. Which are the country names? How can we tell? We have background knowledge about most (hopefully) of the country names, and we can figure out from there that the country names are the first entries in each block next to the flag icons.

If we want a computer to scrape this, though, how could it tell which pieces of text are country names, since it does not have the background knowledge?

A computer sees the HTML code, not the output that we are looking at, so we need to dive in to the code behind the page.

2.1 Elements

As a very simple example of an HTML element, take this code:

<a href="https://sscc.wisc.edu/" class="my-link-class"> Click here </a>

which produces this link:

Click here

This line of code consists of one element. Elements are the fundamental unit in HTML, and elements are composed of several pieces:

  • Element names, which determine the output of the code, such as a link (a), a paragraph (p), or a level-four header (h4)

  • Opening tags and closing tags, which mark the start (<a ... >) and the end of an element (</a>)

  • Content, or text, which is outside angled brackets (<>) and typically what we see displayed on the webpage

  • Attributes, which determine additional properties of the elements, such as the link destination (href) or an element’s class, which is used for styling pages

components of an HTML element

Common HTML elements include:

  • a, anchor, for links
  • body, the body of the document
  • br, line break
  • div and section, sections of the document
  • em, italicized (emphasized) text
  • h1, h2, h3, h4, h5, and h6, level-one through level-six headers
  • html, which defines the document as an HTML document
  • p, paragraph with line breaks before and after
  • span, inline text
  • strong, bold text
  • table, table

Each element can take class and id attributes, and different elements can take different additional attributes. (An id is simply a unique identifier for an element.)

2.2 Relationships Between Elements

Now that we understand the basic syntax of an HTML element, let’s look at how multiple elements come together to form a webpage.

The webpage in this frame is a much-reduced version of the webpage we viewed earlier, http://www.scrapethissite.com/pages/simple/.

The HTML code that produced this page is below.

<html>
  <body>

    <h3>
      Country Data
    </h3>

    <div class="col-md-4 country">
      <h3 class="country-name">
        Bulgaria
      </h3>
      <div class="country-info">
        <strong>Capital:</strong> <span class="country-capital">Sofia</span><br>
        <strong>Population:</strong> <span class="country-population">7148785</span><br>
        <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">110910.0</span><br>
      </div>
    </div>

    <div class="col-md-4 country">
      <h3 class="country-name">
        Bahrain
      </h3>
      <div class="country-info">
        <strong>Capital:</strong> <span class="country-capital">Manama</span><br>
        <strong>Population:</strong> <span class="country-population">738004</span><br>
        <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">665.0</span><br>
      </div>
    </div>

    <section id="footer">
        <div class="container">
            <div class="row">
                Adapted from
                <a href="https://scrapethissite.com/pages/simple/" class="data-attribution" target="_blank">Scrape This Site</a>
            </div>
        </div>
    </section>

  </body>
</html>

We see some familiar elements and attributes in the code above, and some new ones too. Something else we see in this HTML code is that elements can appear before or after other elements, and also within other elements. This becomes clearer if we rearrange the code above into a tree diagram. We will just look at the element names to save space.

HTML document displayed as a tree

Note that the second div element contains all the same elements as the first div, but some elements have been abbreviated with ... for space reasons.

Elements can have one or more of several relationships with another element:

  • Ancestors are elements that contain other elements
    • html is the ancestor of all other elements
    • body is the ancestor of all other elements except html
    • body is the ancestor of three h3 elements
    • section is the ancestor of div, div, and a
  • Descendants are elements contained within other elements
    • All of the strong elements are descendants of div elements, and also of body
    • a is a descendant of section
  • Parents are direct ancestors of other elements: ancestors with no intervening generations
    • html is the parent of body
    • body is the parent of one h3 element, two div elements, and one section element
  • Children are direct descendants of other elements: descendants with no intervening generations
    • a is the child of div
    • Two h3 elements are the children of div elements, and one is the child of body
  • Siblings are elements that share the same parent
    • strong, span, br, strong, span, br, strong, span, and br are siblings
    • a has no siblings
    • Each h3 has at least one div sibling

2.3 Exercises

  1. Review the examples given for each type of relationship, and compare them to the HTML code printed at the beginning of Relationships Between Elements. Make sure you can see the relationships when the code is laid out in this way.

Use the code to answer the following questions, and verify your answers with the image.

  1. How many ancestors does each of the three h3 elements have?

  2. How many descendants does each of the three h3 elements have?

  3. What is the parent of each h3 element?

  4. How many children does each div have?

  5. How many siblings does each div have?