print('Hello World')
Hello World
Data Wrangling in Python
In this chapter you’ll learn how to run Python code using JupyterLab, and some of the core features of the language itself with a focus on the tools you’ll need to do data wrangling.
JupyterLab allows you to combine Python code, the results of running that code, and text discussion in a single easy-to-use file called a Notebook. This makes it a great tool for learning Python, teaching Python, or communicating about work done in Python. It’s also great for initial data wrangling, exploratory analysis, or anything else where you’re frequently going back and forth between code and output, because it puts them close together.
On the other hand, JupyterLab can be cumbersome if your code is long or complicated. It’s not ideal for jobs that will run for a long time or if you need high performance because it adds a layer of complexity that can cause problems.
JupyterLab is actually a web application. When you run JupyterLab it starts a web server on your computer. (This web server is not accessible to anyone else.) It also opens a web browser and points it to the server. The user interface you interact with is just a web page in the browser. You’ll see the server process running on your computer, but you can completely ignore it.
The first version of JupyterLab was called Jupyter Notebook, and you’ll still find plenty of discussion of it. JupyterLab added some nice things like the ability to have multiple Notebooks open in the same browser tab, but all the features of Jupyter Notebook are still there so anything you read about Jupyter Notebook will apply to JupyterLab as well with only minor modifications.
Start JupyterLab, and you’ll see a list of files on the left. This is the content of the folder that JupyterLab started in. JupyterLab can only see files that are in or under the folder where it starts. (Remember it’s running a web server, and for security reasons web servers are designed to only be able to access files in their designated folder.) On a Linux server, be sure to set the working directory to the location of the files you want to work with before starting JupyterLab. On a Windows machine, you can open the properties of the JupyterLab shortcut and change Start in to whatever you need. Use the file list to navigate to the folder where you put the example files.
To create a new Notebook, click the big blue plus sign in the upper left to see the launcher, then click on the kind of Notebook you want. Always choose Python 3 for this book, but kernels are available for R, Stata, Julia, and many other languages. The Notebook will initially be called Untitled.ipynb
, so right-click on its tab at the top and choose Rename Notebook to give it a better name. Call this one Python_Fundamentals_Practice.ipynb
.
You can close the file list by clicking on the folder icon to the far left, giving you more room to edit Notebooks. Repeat to reopen it.
The fundamental unit of a Notebook is a cell. It will normally contain either Python code or Markdown text. The type of a cell can be found at the top of the Notebook window: it defaults to Code but you can click on that to change it.
Double-click on a cell to edit it, or press Esc
to get out of edit mode. When you’re not in edit mode you can press b
to add a blank cell, use the arrows to select different cells, press Enter
to start editing the current cell, or press m
to convert the current cell to Markdown.
You can run a cell by pressing either Shift-Enter
or Ctrl-Enter
, or by clicking the ‘play’ button at the top. When you run a code cell, the code is run and the output, if any, is placed directly below the cell. This makes it easy to see the results of what you just did and decide what to do next. When you run a Markdown cell, the Markdown text is rendered into formatted text.
Most data wrangling code is linear: you’ll carry out a series of steps, and each step will depend on the steps that came before. For example, step one might be to load a data set, and then step two to change the name of a variable in that data set from x
to income
. Obviously step two won’t work until after step one is completed successfully. But step two also won’t work if try to run it twice (there is no longer a variable called x
to be renamed). JupyterLab won’t enforce these rules: it will allow you to run any cell at any time. This can lead to confusion and errors.
Frequently you’ll set out to write a code cell that modifies a DataFrame and make a mistake such that the cell will run, but make the wrong changes. It won’t be enough to fix the code in the cell and run it again–you need to recreate the DataFrame it mangled.
The solution to both these problems is to click Run, Run All Cells. This will run all the cells in your Notebook from the beginning, in order. Think of this as the ‘real’ way to run your code. You don’t always need to use it, but any time you find yourself unsure about the current state of your data and wondering if it’s ready for you to proceed or not, do a Run All Cells. When you think you’re done with a Notebook, the ultimate test is to click Kernel, Restart & Run All Cells, or press the double play button (fast forward?) at the top. This will clear out everything from memory and then run your program, ensuring that, for example, your code doesn’t accidentally depend on something a previous version of it put in memory three hours ago that you’ve completely forgotten about.
Markdown is a “mark-up language” like HTML (HyperText Markup Language), but extremely simple and designed so that you can easily type the markup along with the text. For example, if you put a # at the beginning of a line (paragraph), that line will become a level one heading, while if you put ## it will become a level two heading. You don’t need to learn Markdown in order to use JupyterLab, or even to use Markdown cells: you can just type text in a Markdown cell and it will look like ordinary text. But here are some of the most useful Markdown elements:
# Level 1 Heading
## Level 2 Heading
### Level 3 Heading
`code`
Note that this is the backtick character, the angled quote probably in the upper left corner of your keyboard, not the regular single quote. We’ll use code format for anything you type–or might type, like variable names.
You can use three backticks (```) to start a block of text that should not be formatted. The block ends when you type three backticks again. That’s how this book is preventing these Markdown examples from being rendered. You can also use it for big blocks of code.
*italics*
We’ll use italics the first time we define key terms.
**bold**
1. Ordered List Item 1
1. Ordered List Item 2
The 1. (or any other number) just means “this is a numbered list.” Markdown will do the numbering.
- Unordered List Item
- Unordered List Item
To make a link, put the text you want to appear in square brackets immediately followed by the URL in parentheses:
[Data Wrangling in Python](https://sscc.wisc.edu/sscc/pubs/dwp)
Always put a blank line between paragraphs in Markdown, or it may combine them.
When you ‘run’ a Markdown cell, Jupyter Notebook interprets the Markdown and displays the text in the proper format. To edit the cell again, double-click on it.
Make the top cell of your Notebook a Markdown cell by clicking on Code at the top and changing it to Markdown. Type some text, including at least a header, something in bold and/or italics, and some code. Render it by pressing Ctrl-Enter
or clicking the ‘play’ button and make sure the format is what you expected.
There are no wrong answers for this exercise, but the Markdown for your solution might look something like this before you run it:
# Markdown Exercise
Use one star for *italics* and two stars for **bold**.
Put code in backticks (left single quotes) like this: `this is code`.
JupyterLab automatically save Notebooks periodically, but you can click the button that looks like a disk or press Ctrl-s
to make sure everything is saved right now. Then click File, Shut Down. If that fails to close the web server running behind the scenes, click on it and press Ctrl-c
twice.
While R or Stata were designed specifically for data wrangling and statistical analysis, Python is a general-purpose programming language used for a wide variety of tasks. The Pandas package gives Python the tools needed for data wrangling, but it builds on the foundation provided by Python. Thus before we can start talking about data wrangling we need to talk about some of the core concepts of Python.
Python is an Object Oriented Language, meaning you’ll spend most of your time working with objects. In the programming world, an object is a collection of data and/or functions. Each object is an instance of a class, with the class defining what data and functions the instance will contain. Since you’re using Python for data science rather than general programming, you’ll usually use objects created by others rather than defining your own.
Objects can contain other objects. For example, a DataFrame object, which stores a classic data set with observations as rows and variables as columns, contains one or more Series objects, each storing a column. If you extract a subset of a DataFrame that contains two columns, the result will be another DataFrame. However, if you extract a subset that contains one column, the result will be a Series.
Similar objects will often have the same functions. Both DataFrame and Series have sort_values()
functions, for example. That means you can call the sort_values()
function of an object without knowing or caring if the object is a DataFrame or a Series. That’s a good thing, because sometimes when you extract a subset from a DataFrame (say, all the columns whose names match a certain pattern) you don’t know which one the result will be.
Computer science tradition says that the first program you write in a new language should print “Hello World” on the screen. Do so by typing the following into a cell (leave it set to the default type, Code) and then press Ctrl-Enter
or click the play button at the top:
This tells Python to do two things:
print()
function as an argument, which causes print()
to print it to the screenIn JupyterLab, you can do the same thing by just putting ‘Hello World’ in a cell and running it:
That’s because in JupyterLab, if you reference an object without doing something with it JupyterLab will assume you want to print it (an implicit print). This does not work in other environments. You can only do one implicit print per cell, so if you want to print multiple things use the print()
function (explicit print).
Either way, the string ‘Hello World’ is now gone. If you want to use it in the future you need to give it a name.
This tells Python to store the string ‘Hello World’ as a variable called greeting
. The act of giving the string a name tells Python you want to keep it, as well as giving you a way to reference it. You then pass the greeting
variable to the print()
function. Since there are no quotes around greeting
, Python understands you want to print the content of the variable greeting
rather than the word ‘greeting’.
Create a variable called custom_greeting
that includes your name. Print it to have your computer greet you personally.
Packages are collections of useful functions and class definitions. Once you import a package, you can use its functions and create objects that are instances of its classes. For example, the Pandas package contains the definition for the DataFrame class. You import packages with the import
command. Since you’ll use the name of the package frequently, you’ll often want to give packages shorter nicknames. It’s very common to call Pandas pd
, for example. You do that when you import it:
Many packages are made up of smaller modules. If you only need to import one module from a package, you can do that:
This imports just the date
module from the datetime
package. You can even import individual functions.
Python and Python packages change frequently, and in ways that can break your code. Just finding versions of the packages that you need that are compatible can be a challenge. We’re not going to discuss package management in this book, but once you start writing code that produces results you care about and needs to be reproducible in the long term, we strongly recommend you use conda
or another tool for creating and managing Python environments. Using Conda Environments for Python at the SSCC will show you how.
print()
is a built-in function. It’s always available.
Package functions are associated with a package and carry out tasks related to the package. For example, the Pandas package contains a function called read_csv()
that reads a CSV file and turns it into a DataFrame. To run a package function you refer to package_name.function_name()
. So if you import Pandas as pd
, you’ll run pd.read_csv()
.
Object functions are associated with a particular object, and normally act on that object. For example, a DataFrame object has a sort_values()
function. If you’ve created a DataFrame called my_data
then you can sort it by calling my_data.sort_values()
.
Objects can also have attributes. For example, the columns
attribute of a DataFrame contains information about the columns and functions that act on them. If the DataFrame is called my_data
, then you access the columns with my_data.columns
. Note that you don’t put parentheses after the name of an attribute.
In a function call like print(greeting)
, greeting
is an argument that tells the function what to do. If you need to pass in more than one argument, put commas in between them.
The parentheses ()
at the end of the function have two purposes. First, they tell Python you’re talking about a function rather than a variable or attribute (print
is a variable; print()
is a function). Second, any arguments you need to pass to the function go in the parentheses. But even if a function requires no arguments it needs parentheses so Python knows it’s a function.
You can pass any number of arguments to the print()
function and it will print them. Note that it doesn’t care if you pass in variables containing objects or new objects not stored in a variable. It’s also happy to print things other than strings, but it turns them into strings to do so. More precisely it calls the object’s str()
function, which should return a new string containing a representation of the object. For complicated objects (like DataFrames) that will often be a description of the object or some useful information about it.
Key word arguments, or kwargs in Python documentation, are parameters that are passed in with a name, or key word. The print()
function takes a key word argument called sep
that tells it what separator it should put between the unnamed arguments:
This told print()
to put a new line (denoted by \n
) between each item. If a key word argument is not specified, the default will be used. For sep
the default is to put a space in between the items.
Many functions take one or more arguments without a key word and do something obvious with them (like the print()
function prints them), then expect key word arguments for the rest.
Consider the problem of greeting several different people. Think of a greeting message that includes a person’s name and has text both before and after it. Create one variable for the person’s name (name
), one variable for the text that comes before the name (text_before
), and one variable for the text after the name (text_after
). Print the three variables in the proper order so you get a coherent greeting.
Now change the content of your name
variable to the name of a different person and print a greeting for them. (This is a warm-up for writing loops.)
name = 'Russell'
text_before = 'Hello '
text_after = ', how are you today?'
print(text_before, name, text_after, sep='')
name = 'Jason'
print(text_before, name, text_after, sep='')
Hello Russell, how are you today?
Hello Jason, how are you today?
Note that I made my answer slightly more complicated than it needs to be by choosing a text_after
that puts a comma directly after name
. Thus I need to include sep=''
so print()
doesn’t automatically put a space between name
and text_after
, which means I need to manually add a space at the end of text_before
.
Functions with lots of arguments can quickly become long and hard to read. Reorganizing them so that you have one argument per line can make them much more readable:
When Python reaches the end of a line but can tell that a statement is incomplete, it assumes that the statement continues on the next line. In this case, Python knows the call to print()
is incomplete because there’s an open parenthesis but no close parenthesis until several lines later. Indenting the code that is logically inside the parentheses makes the structure easily recognizable.
The same principle can be applied to any kind of code, not just functions, by starting the statement with an open parenthesis:
One thing you cannot break into multiple lines is a string. The following will not run:
print(
'How are you
today?'
)
However, if you end a string on one line and immediately start another on the next, they will be combined automatically:
Note that this is different from:
In the first version, the strings on each line are combined into a single string before they are passed to print()
. Thus print()
sees them as a single argument and does not put a space between them.
The ability to break up strings across lines is particularly useful with queries, which we’ll see in the next chapter.
Python has several basic data structures. You’ll rarely use them to store actual data–you’ll use Pandas objects like DataFrames for that. But you will use the basic data structures to tell Pandas what to do.
A list stores a list of items. You’ll use them in Pandas to specify lists of variables, for example. The items go in square brackets with commas in between them:
The items can be of any type, and do not have to be of the same type:
You can reference a particular item from a list by putting square brackets after the list name containing the number of the item you want. However, item numbers always start with zero, so [1]
will get you the second item!
Many other types of objects can use the same square bracket notation to identify subsets, including strings:
You can use subsets on either side of an equals sign. This allows you to change part of a list:
A slice is a range of integers, specified with start:end
. You can select multiple items from a list by putting a slice in the square brackets:
Note that the starting item is included in the slice, but the ending item is not: to include greeting[6]
in the slice we had to specify that the slice goes from 0 to 7.
Create a list that contains your telephone number as separate digits (e.g. [6, 0, 8,…]). Then create a subset containing just the area code using a slice.
The in
operator lets you ask if an item is in a list:
You can create a empty list with empty square brackets:
You can then add items to it (or any list) with the append()
function:
Hardcore Python programmers will warn you not to build big lists this way, as every append()
call forces Python to take time to adjust the memory structure of the entire list, but that won’t be an issue the way we’ll use it.
Tuples are like lists, but they go in parentheses rather than square brackets. They mostly act like lists and can be subsetted like lists.
The difference between a list and a tuple is that tuples are immutable, meaning that once you create one you cannot change it. That means you cannot run:
my_tuple[2] = 'Claire'
However, you can change a variable containing a tuple so that it points to a new tuple:
We won’t use tuples nearly as often as lists, but some tasks require them.
A dictionary is a list of pairs of items: a key and a value. In a classic printed dictionary (i.e. a book), the word would be the key and the definition of the word would be the value. However, Pandas frequently uses a dictionary when it just needs pairs of things, like the old name of a variable and the new name for the variable.
Dictionaries go in curly brackets, with a colon between items in a pair and commas between pairs. Keys can be numbers or strings (or other immutable things); values can be anything.
To select a value from a dictionary, put the corresponding key in square brackets.
Note that my_list[1]
and my_dictionary[1]
mean very different things! my_list[1]
selects item number 1 from my_list
, which, since Python numbers from 0, is the second item. my_dictionary[1]
selects the value from my_dictionary
corresponding to the key 1, regardless of where it is in the dictionary.
Use a dictionary to create a contacts list with names and phone numbers for at least three people. (Hint: the name of the person will be the key and their phone number the value.) Demonstrate how to retrieve the phone number of a person.
The workflow of data analysis programs tends to be very simple: execute all the statements in order. But there are times when you need to change that flow.
An if
statement allows you to specify that some code should only be executed if a certain condition is true. For example:
Put in different values for x
and see what happens.
Note that the print()
line is indented. In Python indentation isn’t just a good practice to make your code readable; it’s how Python knows which statements are only to be executed if the condition is true:
x = -1
if x>0:
print('x is positive.')
print('That means it is greater than zero.')
print('I am done talking about x now.')
I am done talking about x now.
The third print()
is not part of the if
block, so it is executed regardless of the value of x
.
elif
is short for else if. If an if
block with one condition is followed by a similar elif
block with a second condition, the code in the elif
block will be executed if the first condition is false but the second condition is true. You can have any number of elif
blocks. An else
block at the end will be executed if none of the previous condition are true:
x = 0
if x>0:
print('x is positive.')
print('That means it is greater than zero.')
elif x<0:
print('x is negative.')
else:
print('x is zero.')
print('I am done talking about x now.')
x is zero.
I am done talking about x now.
We’ll talk more about conditions in the next chapter in the context of selecting observations from a data set based on conditions, but you can use everything you’ll learn there with if
as well.
The poverty level for a family of four in the year 2000 was $17,050. Create a variable called income
and store a number in it. Then write series of if
/elif
/else
blocks that prints ‘poor’ if income
is below the poverty level, ‘low income’ if income
is above the poverty level but lower than two times the poverty level, and ‘not low income’ if income
is above two times the poverty level. Rerun your code with several values of income
and make sure you always get the right answer.
A for
loop allows you to execute a block of code once for each item in a list. The item is stored in a variable you can use in the code. For example, the following code takes a list of numbers and prints the square of each one:
The statement for i in numbers
tells Python to create a variable called i
, store the first value from the list numbers
in it, and then execute the following block of code. The block is defined by indentation, just like with if
. When it reaches the end of the block (one line in this case) Python goes back to the top, puts the second item in the list in i
, and repeats the process. It stops when it runs out of items in numbers
.
The list to be looped over does not need to be defined ahead of time–it can be created directly in the for
statement. The range()
function is an easy way to make lists of numbers. range(n)
returns a list of numbers from 0 to n-1 (the endpoint is not included, like in a slice); range(n1, n2)
returns a list of numbers from n1 to n2-1. (range()
can also make more complicated lists–see the documentation.) So we could do the same thing with:
In a previous exercise you printed greetings for multiple people by defining name
, text_before
(text that comes before the name), and text_after
(text that comes after the name), printing the the three variables, changing name
, and printing them again. Now create a list of names and loop over it, printing a greeting for each name in the list.
A list comprehension is a kind of backward for loop that takes a list, transforms each item in the list in some way, and returns them as a new list. Earlier we took a list of numbers and printed the square of each number; if we instead wanted to create a list containing the numbers squared, we could do it with:
Note how the list comprehension goes in square brackets just like a normal list. Inside the brackets, for i in numbers
is familiar and does the same thing as in a for loop, but the code to be executed comes first this time.
Lists are frequently used to store data set column names as strings. Use a list comprehension to create a list containing the variable names x1, x2, x3 through x10 by adding a number i
to the string 'x'
.
Hint 1: In Python, adding two strings concatenates them (puts the second one after the first one). ‘abc’ + ‘def’ = ‘abcdef’.
Hint 2: To add a number i
to a string, convert it to a string with str(i)
.
You can write your own functions and then use them just like built-in Python functions. Start with def
, short for define, give the function a name followed by parentheses (we’ll put arguments in those parentheses shortly), and a colon. Start with a very basic function for greeting people:
To have your function accept arguments, put them in the parentheses after the function name. For example:
Note that within the function the parameter will be called name
, but the thing you pass in can have a different name or no name at all:
Hello Andy, how are you today?
Hello David, how are you today?
The parameter name you choose can also be used as the key word for a key word argument:
A function can (and usually will) return a value that can be used for other purposes. Here’s a version of greeting()
that returns the greeting as a string, which we can then print or store for later use:
A function ends as soon as it encounters a return
statement:
A function has its own namespace for variables: variables defined outside the function are not available within the function, and variables defined within the function go away when the function ends. This allows a function to define and use variables without having to worry about the possibility that the rest of the program was using variables with those names for something else. Ideally, information only comes into the function as parameters and only goes out as returned values. (You can break that rule, if you need to, with global
.)
The change_x()
function fails to actually change x
, because the x
in the main program and the x
in the function are actually completely different variables.
Define a function that takes a parameter income
and returns the string ‘poor’, ‘low income’, or ‘not low income’ just like in the previous exercise. Specifically, return ‘poor’ if income
is less than $17,050, the year 2000 poverty level for a family of four; ‘low income’ if income
is less than two times the poverty level; and ‘not low income’ if `income is greater than two times the poverty level.
Call the function several times with different values of income
and make sure it gives the correct answer.
It would also work to use if
, elif
, and else
exactly like in the previous exercise. But remember that a function stops running after it carries out a return
statement, so elif
and else
are not strictly needed here.
A lambda function is a one-line function without a name. We’ll use them later when we need to embed a particular variable transformation in a string of code. The syntax is:
lamba x: f(x)
where f(x)
is some single expression involving x, though you can give x a different name. The result of the expression is the return value of the function.
Usually a lambda function will be defined and used at the same time, but you can store it as a variable and use it later:
Used this way, a lambda function is just an alternative to the standard way of defining a function. You’ll see how they can be more useful later.
1.1.4 Comments
Comments are text included in your code for the benefit of human readers, not for Python. Comments can explain what your code does and why, and if anyone else ever needs to read and understand your code they’ll be very grateful for good comments. But the person who is most likely to benefit from your comments is you, when you have to figure out how your program works months or years after writing it. Sadly, your past self is a poor collaborator who never answers emails.
Notebooks use Markdown cells for comments in the traditional sense. However, you can also use comments to tell the computer not to run certain parts of your code without having to delete it. This very helpful if you only want to remove code temporarily, or if you’re not sure you really want to remove it.
You can turn a line of text in a code cell into a comment by putting # in front of it:
To prevent a whole cell from running, turn it into a Markdown cell.