1  Python Fundamentals

For most of this book you’ll learn Python in the process of learning data wrangling. But you do need to understand some basics before you can do anything useful, and we’ll cover them in this chapter and the next.

1.1 Running Python Using JupyterLab

JupyterLab allows you to combine Python code, the results of running that code, and text discussion in a single easy-to-use file called a Notebook. This makes it a great tool for learning Python, teaching Python, or communicating about work done in Python. It’s also great for initial data wrangling, exploratory analysis, or anything else where you’re frequently going back and forth between code and output, because it puts them close together.

On the other hand, JupyterLab can be cumbersome if your code is long or complicated. It’s not ideal for jobs that will run for a long time or if you need high performance because it adds a layer of complexity that can cause problems.

JupyterLab is actually a web application. When you run JupyterLab it starts a web server on your computer. (This web server is not accessible to anyone else.) It also opens a web browser and points it to the server. The user interface you interact with is just a web page in the browser. You’ll see the server process running on your computer, but you can completely ignore it.

The first version of JupyterLab was called Jupyter Notebook, and you’ll still find plenty of discussion of it. JupyterLab added some nice things like the ability to have multiple Notebooks open in the same browser tab, but all the features of Jupyter Notebook are still there so anything you read about Jupyter Notebook will apply to JupyterLab as well with only minor modifications.

1.1.2 Cells

The fundamental unit of a Notebook is a cell. It will normally contain either Python code or Markdown text. The type of a cell can be found at the top of the Notebook window: it defaults to Code but you can click on that to change it.

Double-click on a cell to edit it, or press Esc to get out of edit mode. When you’re not in edit mode you can press b to add a blank cell, use the arrows to select different cells, press Enter to start editing the current cell, or press m to convert the current cell to Markdown.

You can run a cell by pressing either Shift-Enter or Ctrl-Enter, or by clicking the ‘play’ button at the top. When you run a code cell, the code is run and the output, if any, is placed directly below the cell. This makes it easy to see the results of what you just did and decide what to do next. When you run a Markdown cell, the Markdown text is rendered into formatted text.

Most data wrangling code is linear: you’ll carry out a series of steps, and each step will depend on the steps that came before. For example, step one might be to load a data set, and then step two to change the name of a variable in that data set from x to income. Obviously step two won’t work until after step one is completed successfully. But step two also won’t work if try to run it twice (there is no longer a variable called x to be renamed). JupyterLab won’t enforce these rules: it will allow you to run any cell at any time. This can lead to confusion and errors.

Frequently you’ll set out to write a code cell that modifies a DataFrame and make a mistake such that the cell will run, but make the wrong changes. It won’t be enough to fix the code in the cell and run it again–you need to recreate the DataFrame it mangled.

The solution to both these problems is to click Run, Run All Cells. This will run all the cells in your Notebook from the beginning, in order. Think of this as the ‘real’ way to run your code. You don’t always need to use it, but any time you find yourself unsure about the current state of your data and wondering if it’s ready for you to proceed or not, do a Run All Cells. When you think you’re done with a Notebook, the ultimate test is to click Kernel, Restart & Run All Cells, or press the double play button (fast forward?) at the top. This will clear out everything from memory and then run your program, ensuring that, for example, your code doesn’t accidentally depend on something a previous version of it put in memory three hours ago that you’ve completely forgotten about.

1.1.3 Markdown

Markdown is a “mark-up language” like HTML (HyperText Markup Language), but extremely simple and designed so that you can easily type the markup along with the text. For example, if you put a # at the beginning of a line (paragraph), that line will become a level one heading, while if you put ## it will become a level two heading. You don’t need to learn Markdown in order to use JupyterLab, or even to use Markdown cells: you can just type text in a Markdown cell and it will look like ordinary text. But here are some of the most useful Markdown elements:

# Level 1 Heading

## Level 2 Heading

### Level 3 Heading

`code` 

Note that this is the angled quote in the upper left corner of the keyboard, not the regular single quote. We’ll use code format for anything you type–or might type, like variable names.


*italics* 

We’ll use italics the first time we define key terms.


**bold** 

1. Ordered List Item 1
1. Ordered List Item 2

The 1. (or any other number) just means “this is a numbered list.” Markdown will do the numbering.


- Unordered List Item
- Unordered List Item

To make a link, put the text you want to appear in square brackets immediately followed by the URL in parentheses:

[Data Wrangling in Python](https://sscc.wisc.edu/sscc/pubs/dwp)

Always put a blank line between paragraphs in Markdown, or it may combine them.

When you ‘run’ a Markdown cell, Jupyter Notebook interprets the Markdown and displays the text in the proper format. To edit the cell again, double-click on it.

Exercise

Make the top cell of your Notebook a Markdown cell by clicking on Code at the top and changing it to Markdown. Type some text, including at least a header, something in bold and/or italics, and some code. Render it by pressing Ctrl-Enter or clicking the ‘play’ button and make sure the format is what you expected.

1.1.4 Comments

Comments are text included in your code for the benefit of human readers, not for Python. Comments can explain what your code does and why, and if anyone else ever needs to read and understand your code they’ll be very grateful for good comments. But you are the most likely beneficiary of your comments, when you have to figure out how your program works months or years after writing it. Sadly, your past self is a poor collaborator who never answers emails.

Notebooks use Markdown cells for comments in the traditional sense. However, you can also use comments to tell the computer not to run certain parts of your code without having to delete it. This very helpful if you only want to remove code temporarily, or if you’re not sure you really want to remove it.

You can turn a line of text in a code cell into a comment by putting # in front of it:

#print('Maybe I don't want to print this after all')

Sometimes you’ll run some code and learn from the results, but not need to keep it as part of your actual workflow. In that case you can take that code and put it in a Markdown cell along with an explanation of what you learned. You can tell Markdown not to format the code by putting it in a block that starts with ``` and ends with ```:

print('Maybe I don't want to print this after all')

1.1.5 Closing JupyterLab

JupyterLab automatically save Notebooks periodically, but you can click the button that looks like a disk to make sure everything is saved right now. Then click File, Shut Down. If that fails to close the web server running behind the scenes, click on it and press Ctrl-c twice.

1.2 Python Core Concepts

While R or Stata were designed specifically for data wrangling and statistical analysis, Python is a general-purpose programming language used for a wide variety of tasks. The Pandas package gives Python the tools needed for data wrangling, but it builds on the foundation provided by Python. Thus before we can start talking about data wrangling we need to talk about some of the core concepts of Python.

1.2.1 Introducing Objects

Python is an Object Oriented Language, meaning you’ll spend most of your time working with objects. In the programming world, an object is a collection of data and/or functions. Each object is an instance of a class, with the class defining what data and functions the instance will contain. Since you’re using Python for data science rather than general programming, you’ll usually use objects created by others rather than defining your own.

Objects can contain other objects. For example, a DataFrame object, which stores a classic data set with observations as rows and variables as columns, contains one or more Series objects, each storing a column. If you extract a subset of a DataFrame that contains two columns, the result will be another DataFrame. However, if you extract a subset that contains one column, the result will be a Series.

Similar objects will often have the same functions. Both DataFrame and Series have sort_values() functions, for example. That means you can call the sort_values() function of an object without knowing or caring if the object is a DataFrame or Series. That’s a good thing, because sometimes when you extract a subset from a DataFrame (say, all the columns whose names match a certain pattern) you don’t know which one the result will be.

1.2.2 Making and Storing Objects

Computer science tradition says that the first program you write in a new language should print “Hello World” to the screen. Do so by typing the following into a cell (leave it set to the default type, Code) and then press Ctrl-Enter or click the play button at the top:

print('Hello World')
Hello World

This tells Python to do two things:

  • Create an object containing the string (text) ‘Hello World’
  • Pass this object to the print() function as an argument, which causes print() to print it to the screen

In JupyterLab, you can do the same thing by just putting ‘Hello World’ in a cell and running it:

'Hello World'
'Hello World'

That’s because in JupyterLab, if you reference an object without doing something with it JupyterLab will assume you want to print it (an implicit print). This does not work in other environments. You can also only use one implicit print per cell, so if you want to print multiple things use the print() function (explicit print).

Either way, the string ‘Hello World’ is now gone. If you want to use it in the future you need to give it a name.

greeting = 'Hello World'
print(greeting)
Hello World

This tells Python to store the string ‘Hello World’ as a variable called greeting. The act of giving the string a name tells Python you want to keep it, as well as giving you a way to reference it. You then pass the greeting variable to the print() function. Since there are no quotes around greeting, Python understands you want to print the content of the variable greeting rather than the word ‘greeting’.

Exercise

Create a variable called custom_greeting that includes your name. Print it to have your computer greet you personally.

1.2.3 Packages

Packages are collections of useful functions and class definitions. Once you import a package, you can use its functions and create objects that are instances of its classes. For example, the Pandas package contains the definition for the DataFrame class. You import packages with the import command. Since you’ll use the name of the package frequently, you’ll often want to give packages shorter nicknames. It’s very common to call Pandas pd, for example. You do that when you import it:

import pandas as pd

Many packages are made up of smaller modules. If you only need to import one module from a package, you can do that:

from datetime import date

This imports just the date module from the datetime package.

The Anaconda distribution of Python includes most of the packages you are likely to need for data wrangling. If you need to install new packages, conda and pip are the most common tools, though we won’t discuss their usage. SSCC members using SSCC’s servers should read Installing Python Packages on Winstat or (for Linux servers) the section on installing Python packages in the Guide to Research Computing at the SSCC.

1.2.4 Functions and Arguments

print() is a built-in function. It’s always available.

Package functions are associated with a package and carry out tasks related to the package. For example, the Pandas package contains a function called read_csv() that reads a CSV file and turns it into a DataFrame. To run a package function you refer to package_name.function_name(). So if you import Pandas as pd, you’ll refer to pd.read_csv().

Object functions are associated with a particular object, and normally act on that object. For example, a DataFrame object has a sort_values() function. If you’ve created a DataFrame called my_data then you can sort it by calling my_data.sort_values().

Objects can also have attributes. For example, the columns attribute of a DataFrame contains information about the columns and functions that act on them. If the DataFrame is called my_data, then you access the columns with my_data.columns. Note that you don’t put parentheses after the name of an attribute.

In a function call like print(greeting), greeting is an argument that tells the function what to do. If you need to pass in more than one argument, put commas in between them.

The parentheses () at the end of the function have two purposes. First, they tell Python you’re talking about a function rather than a variable or attribute (print is a variable; print() is a function). Second, any arguments you need to pass to the function go in the parentheses. But even if a function requires no arguments it needs parentheses so Python knows it’s a function.

You can pass any number of arguments to the print() function and it will print them. Note that it doesn’t care if you pass in variables containing objects or new objects not stored in a variable. It’s also happy to print things other than strings, but it turns them into strings to do so. More precisely it calls the object’s str() function, which should return a new string containing a representation of the object. For complicated objects (like DataFrames) that will often be a description of the object or some useful information about it.

print(greeting, 'how are you today?', 12345)
Hello World how are you today? 12345

Key word arguments, or kwargs in Python documentation, are parameters that are passed in with a name, or key word. The print() function takes a key word argument called sep that tells it what separator it should put between the unnamed arguments:

print(greeting, 'how are you today?', 12345, sep='\n')
Hello World
how are you today?
12345

This told print() to put a new line (denoted by \n) between each item. If a key word argument is not specified, the default will be used. For sep the default is to put a space in between the items.

Many functions follow the pattern of taking one or more unnamed arguments and doing something obvious with them (like the print() function printing them), then using key word arguments for everything else.

Exercise

Consider the problem of greeting several different people. Think of a greeting message that includes a person’s name and has text both before and after it. Create one variable for the person’s name, one variable for the text that comes before the name, and one variable for the text after the name. Print the three variables in the proper order so you get a coherent greeting.

Now change the value of your name variable to the name of a different person and print a greeting for them. (This is a warm-up for eventually writing loops.)

1.2.5 Line Breaks and Readability

Functions with lots of arguments can quickly become long and hard to read. Reorganizing them so that you have one argument per line can make them much more readable:

print(
    greeting,
    'how are you today?',
    12345,
    sep='\n'
)
Hello World
how are you today?
12345

When Python reaches the end of a line but can tell that a statement is incomplete, it assumes that the statement continues on the next line. In this case, Python knows the call to print() is incomplete because there’s an open parenthesis, (, but no close parenthesis, ), until several lines later. Indenting the code that is logically (if not physically) inside the parentheses makes the structure easily recognizable.

The same principle can be applied to any kind of code, not just functions, by starting the statement with an open parenthesis:

(
    print(
        greeting,
        'how are you today?',
        12345,
        sep='\n'
    )
)
Hello World
how are you today?
12345

1.3 Basic Data Structures

Python has several basic data structures. You’ll rarely use them to store actual data–you’ll use Pandas objects like DataFrames for that. But you will use the basic data structures to tell Pandas what to do.

1.3.1 Lists

A list stores a list of items. You’ll use them in Pandas to specify lists of variables, for example. The items go in square brackets with commas in between them:

my_list = [1, 2, 3]
my_list
[1, 2, 3]

The items can be of any type, and do not have to be of the same type:

my_list = [1, 2, 'Fred']
my_list
[1, 2, 'Fred']

You can reference a particular item from a list by putting square brackets after the list name containing the number of the item you want. However, item numbers always start with zero, so [1] will get you the second item!

print(my_list[0])
print(my_list[1])
print(my_list[2])
1
2
Fred

Many other types of objects can use the same square bracket notation to identify subsets, including strings:

greeting = 'Hello World'
print(greeting[0], greeting[6])
H W

You can use subsets on either side of an equals sign. This allows you to change part of a list:

my_list[2] = 'Claire'
my_list
[1, 2, 'Claire']

A slice is a range of integers, specified with start:end. You can select multiple items from a list by putting a slice in the square brackets:

greeting[0:7]
'Hello W'

Note that the starting item is included in the slice, but the ending item is not: to include greeting[6] in the slice we had to specify that the slice goes from 0 to 7.

Exercise

Create a list that contains your telephone number as separate digits (e.g. [6, 0, 8,…]). Then create a subset containing just the area code using a slice.

1.3.2 Tuples

Tuples are like lists, but they go in parentheses rather than square brackets. They mostly act like lists and can be subsetted like lists.

my_tuple = (1, 2, 'Fred')
my_tuple[0]
1

The difference between a list and a tuple is that tuples are immutable, meaning that once you create one you cannot change it. That means you cannot run:

my_tuple[2] = 'Claire'

However, you can change a variable containing a tuple so that it points to a new tuple:

my_tuple = (1, 2, 'Claire')
my_tuple
(1, 2, 'Claire')

We won’t use tuples nearly as often as lists, but some tasks require them.

1.3.3 Dictionaries

A dictionary is a list of pairs of items: a key and a value. In a classic printed dictionary (i.e. a book), the word would be the key and the definition of the word would be the value. However, Pandas frequently uses a dictionary when it just needs pairs of things, like the old name of a variable and the new name for the variable.

Dictionaries go in curly brackets, with a colon between items in a pair and commas between pairs. Keys can be numbers or strings (or other “immutable” things); values can be anything.

my_dictionary = {1: 'a', 3: 'b', 'Fred': 5}
my_dictionary
{1: 'a', 3: 'b', 'Fred': 5}

To select a value from a dictionary, put the corresponding key in square brackets.

print(my_dictionary[1])
print(my_dictionary['Fred'])
a
5

Note that my_list[1] and my_dictionary[1] mean very different things! my_list[1] selects item number 1 from my_list, which, since Python numbers from 0, is the second item. my_dictionary[1] selects the value from my_dictionary corresponding to the key 1, regardless of where it is in the dictionary.

Exercise

Use a dictionary to create a contacts list with names and phone numbers for at least three people. (Hint: the name of the person will be the key and their phone number the value.) Demonstrate how to retrieve the phone number of a person.