2 HTML
HTML is the language used to create webpages. When we copy and paste text from a webpage, we are interacting with the result of the HTML code, which tells our internet browsers how to display the content. When we scrape, we download the HTML code, so we need to dive into this language and understand its syntax so that we can extract the information we want. Webpages are designed for the user experience, not data collection, so it can get messy.
Take a look at this page: http://www.scrapethissite.com/pages/simple/
Imagine we want to scrape all of the country names off of the page. Which are the country names? How can we tell? We have background knowledge about most (hopefully) of the country names, and we can figure out from there that the country names are the first entries in each block next to the flag icons.
If we want a computer to scrape this, though, how could it tell which pieces of text are country names, since it does not have the background knowledge?
A computer sees the HTML code, not the output that we are looking at, so we need to dive in to the code behind the page.
2.1 Elements
As a very simple example of an HTML element, take this code:
<a href="https://sscc.wisc.edu/" class="my-link-class"> Click here </a>
which produces this link:
This line of code consists of one element. Elements are the fundamental unit in HTML, and elements are composed of several pieces:
Element names, which determine the output of the code, such as a link (
a
), a paragraph (p
), or a level-four header (h4
)Opening tags and closing tags, which mark the start (
<a ... >
) and the end of an element (</a>
)Content, or text, which is outside angled brackets (
<>
) and typically what we see displayed on the webpageAttributes, which determine additional properties of the elements, such as the link destination (
href
) or an element’sclass
, which is used for styling pages
Common HTML elements include:
a
, anchor, for linksbody
, the body of the documentbr
, line breakdiv
andsection
, sections of the documentem
, italicized (emphasized) texth1
,h2
,h3
,h4
,h5
, andh6
, level-one through level-six headershtml
, which defines the document as an HTML documentp
, paragraph with line breaks before and afterspan
, inline textstrong
, bold texttable
, table
Each element can take class
and id
attributes, and different elements can take different additional attributes. (An id
is simply a unique identifier for an element.)
2.2 Relationships Between Elements
Now that we understand the basic syntax of an HTML element, let’s look at how multiple elements come together to form a webpage.
The webpage in this frame is a much-reduced version of the webpage we viewed earlier, http://www.scrapethissite.com/pages/simple/.
The HTML code that produced this page is below.
<html>
<body>
<h3>
Country Data
</h3>
<div class="col-md-4 country">
<h3 class="country-name">
Bulgaria
</h3>
<div class="country-info">
<strong>Capital:</strong> <span class="country-capital">Sofia</span><br>
<strong>Population:</strong> <span class="country-population">7148785</span><br>
<strong>Area (km<sup>2</sup>):</strong> <span class="country-area">110910.0</span><br>
</div>
</div>
<div class="col-md-4 country">
<h3 class="country-name">
Bahrain
</h3>
<div class="country-info">
<strong>Capital:</strong> <span class="country-capital">Manama</span><br>
<strong>Population:</strong> <span class="country-population">738004</span><br>
<strong>Area (km<sup>2</sup>):</strong> <span class="country-area">665.0</span><br>
</div>
</div>
<section id="footer">
<div class="container">
<div class="row">
Adapted from
<a href="https://scrapethissite.com/pages/simple/" class="data-attribution" target="_blank">Scrape This Site</a>
</div>
</div>
</section>
</body>
</html>
We see some familiar elements and attributes in the code above, and some new ones too. Something else we see in this HTML code is that elements can appear before or after other elements, and also within other elements. This becomes clearer if we rearrange the code above into a tree diagram. We will just look at the element names to save space.
Note that the second div
element contains all the same elements as the first div
, but some elements have been abbreviated with ...
for space reasons.
Elements can have one or more of several relationships with another element:
- Ancestors are elements that contain other elements
html
is the ancestor of all other elementsbody
is the ancestor of all other elements excepthtml
body
is the ancestor of threeh3
elementssection
is the ancestor ofdiv
,div
, anda
- Descendants are elements contained within other elements
- All of the
strong
elements are descendants ofdiv
elements, and also ofbody
a
is a descendant ofsection
- All of the
- Parents are direct ancestors of other elements: ancestors with no intervening generations
html
is the parent ofbody
body
is the parent of oneh3
element, twodiv
elements, and onesection
element
- Children are direct descendants of other elements: descendants with no intervening generations
a
is the child ofdiv
- Two
h3
elements are the children ofdiv
elements, and one is the child ofbody
- Siblings are elements that share the same parent
strong
,span
,br
,strong
,span
,br
,strong
,span
, andbr
are siblingsa
has no siblings- Each
h3
has at least onediv
sibling
2.3 Exercises
- Review the examples given for each type of relationship, and compare them to the HTML code printed at the beginning of Relationships Between Elements. Make sure you can see the relationships when the code is laid out in this way.
Use the code to answer the following questions, and verify your answers with the image.
How many ancestors does each of the three
h3
elements have?How many descendants does each of the three
h3
elements have?What is the parent of each
h3
element?How many children does each
div
have?How many siblings does each
div
have?