HTML is the language used to create webpages. When we copy and paste text from a webpage, we are interacting with the result of the HTML code, which tells our internet browsers how to display the content. When we scrape, we download the HTML code, so we need to dive into this language and understand its syntax so that we can extract the information we want. Webpages are designed for the user experience, not data collection, so it can get messy.
Take a look at this page: http://www.scrapethissite.com/pages/simple/
Imagine we want to scrape all of the country names off of the page. Which are the country names? How can we tell? We have background knowledge about most (hopefully) of the country names, and we can figure out from there that the country names are the first entries in each block next to the flag icons.
If we want a computer to scrape this, though, how could it tell which pieces of text are country names, since it does not have the background knowledge?
A computer sees the HTML code, not the output that we are looking at, so we need to dive in to the code behind the page.
As a very simple example of an HTML element, take this code:
<a href="https://sscc.wisc.edu/" class="my-link-class"> Click here </a>
which produces this link:
This line of code consists of one element. Elements are the fundamental unit in HTML, and elements are composed of several pieces:
Element names, which determine the output of the code, such as a link (
a), a paragraph (
p), or a level-four header (
Opening tags and closing tags, which mark the start (
<a ... >) and the end of an element (
Content, or text, which is outside angled brackets (
<>) and typically what we see displayed on the webpage
Attributes, which determine additional properties of the elements, such as the link destination (
href) or an element’s
class, which is used for styling pages
Common HTML elements include:
a, anchor, for links
body, the body of the document
br, line break
section, sections of the document
em, italicized (emphasized) text
h6, level-one through level-six headers
html, which defines the document as an HTML document
p, paragraph with line breaks before and after
span, inline text
strong, bold text
Each element can take
id attributes, and different elements can take different additional attributes. (An
id is simply a unique identifier for an element.)
2.2 Relationships Between Elements
Now that we understand the basic syntax of an HTML element, let’s look at how multiple elements come together to form a webpage.
The webpage in this frame is a much-reduced version of the webpage we viewed earlier, http://www.scrapethissite.com/pages/simple/.
The HTML code that produced this page is below.
<html> <body> <h3> Country Data </h3> <div class="col-md-4 country"> <h3 class="country-name"> Bulgaria </h3> <div class="country-info"> <strong>Capital:</strong> <span class="country-capital">Sofia</span><br> <strong>Population:</strong> <span class="country-population">7148785</span><br> <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">110910.0</span><br> </div> </div> <div class="col-md-4 country"> <h3 class="country-name"> Bahrain </h3> <div class="country-info"> <strong>Capital:</strong> <span class="country-capital">Manama</span><br> <strong>Population:</strong> <span class="country-population">738004</span><br> <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">665.0</span><br> </div> </div> <section id="footer"> <div class="container"> <div class="row"> Adapted from <a href="https://scrapethissite.com/pages/simple/" class="data-attribution" target="_blank">Scrape This Site</a> </div> </div> </section> </body> </html>
We see some familiar elements and attributes in the code above, and some new ones too. Something else we see in this HTML code is that elements can appear before or after other elements, and also within other elements. This becomes clearer if we rearrange the code above into a tree diagram. We will just look at the element names to save space.
Note that the second
div element contains all the same elements as the first
div, but some elements have been abbreviated with
... for space reasons.
Elements can have one or more of several relationships with another element:
- Ancestors are elements that contain other elements
htmlis the ancestor of all other elements
bodyis the ancestor of all other elements except
bodyis the ancestor of three
sectionis the ancestor of
- Descendants are elements contained within other elements
- All of the
strongelements are descendants of
divelements, and also of
ais a descendant of
- All of the
- Parents are direct ancestors of other elements: ancestors with no intervening generations
htmlis the parent of
bodyis the parent of one
divelements, and one
- Children are direct descendants of other elements: descendants with no intervening generations
ais the child of
h3elements are the children of
divelements, and one is the child of
- Siblings are elements that share the same parent
ahas no siblings
h3has at least one
- Review the examples given for each type of relationship, and compare them to the HTML code printed at the beginning of Relationships Between Elements. Make sure you can see the relationships when the code is laid out in this way.
Use the code to answer the following questions, and verify your answers with the image.
How many ancestors does each of the three
How many descendants does each of the three
What is the parent of each
How many children does each
How many siblings does each