A great deal of data can be found in web pages, and "web scraping" is the process of turning those web pages into usable data sets. Stata's capabilities in this area are limited, but SSCC staff have written several programs that can carry out simple web scraping tasks. This article will introduce you to the readhtml package we've developed. You can get it by starting Stata and typing:
net install readhtml, from(https://ssc.wisc.edu/sscc/stata/)
net from https://ssc.wisc.edu/sscc/stata/
Then click on the readhtml link and click here to install.
(If you get an error trying to use one of the above commands, try replacing https with http.)
The readhtml package is in the early stages of development, so you should check its results carefully—though if something goes wrong it's usually obvious. If you find it does not work properly for a given web page (keeping in mind it only reads tables and lists) please let us know by emailing email@example.com, but it will never be able to handle every web page. If you need to parse a web page readhtml can't handle, the code for the main programs may give you some ideas for how to do it.
The readhtml package
The readhtml package contains two main programs and two utility programs.
The readhtmltable program reads a web page, identifies any tables it contains, and turns them into a data set. Try scraping the SSCC's training schedule with:
You can tell it to use the first row as variable names with:
readhtmltable https://ssc.wisc.edu/sscc_jsp/training/, varnames
The varnames option can be abbreviated to just v. This gives you a usable data set containing the SSCC's training schedule for the current semester.
The readhtmllist program reads a web page, identifies any lists it contains, and turns them into a data set. Try:
This gives you a data set containing all the issues of SSCC News. Note that each year's issues are in a separate list, so readhtmllist created a variable for each list. In this case it might make sense to combine all the lists, but that won't always be true. The readhtmltable program will do something similar if a page contains multiple tables.
If you're interested in the links to the issues as well as their names, you can tell readhtmllist not to remove HTML markup with the html option:
readhtmllist https://ssc.wisc.edu/sscc/ssccnews/, html
Parsing the results and extracting the URL is left to the reader. Note that these are relative links, so to use them you'd have to put https://ssc.wisc.edu/sscc/ssccnews/ before each one.
The readhtml package also contains two utility programs which were written for the use of the other programs in the package, but you are welcome to use them independently. The striphtml program removes all HTML markup from a string—the html option tells readhtmltable or readhtmllist not to run striphtml. The striphtmlcomments program removes all HTML comments from a string. The readhtmltable and readhtmllist programs use it to remove content the web page author never meant to be seen, presumably for a reason. For example, the source code for the SSCC training schedule still contains code for some special classes we no longer teach, but they're "commented out" so they're not visible, and you would not want to include them in a data set of SSCC classes.
Acknowledgment: The readhtml package, and the names of the key programs, were inspired by R's XML package.
Last Revised: 6/26/2018