Data Wrangling in Python

Author

Russell Dimond

Published

May 24, 2024

Introduction

Most data sets need to be transformed in some way before they can be analyzed, a process that’s come to be known as data wrangling. Data Wrangling in Python will introduce you to the key concepts, tools, and skills of data wrangling, implementing them in Python using primarily the Pandas package.

What to Expect From This Book

By the time you finish this book you’ll have a solid set of data wrangling skills, though some topics were omitted due to time constraints (this book is the curriculum for a workshop), notably working with text data and date/time data.

You’ll also know a good bit of useful Python, but nowhere near everything there is to know. Unlike R or Stata, Python is a general-purpose programming language used for a wide variety of tasks. It was developed by computer scientists after the computer science community had several decades of experience creating computer languages. Thus it tends to be highly abstract, and it can be challenging for new Python users to understand what it is actually doing and why. We’ll do our best to help you understand Python the language as well as how to use it to wrangle data, but the focus is on the latter.

No prior knowledge of either data wrangling or Python is needed to benefit from this book, as we’ll start from the beginning with both. You’ll need to learn both even if your eventual goal is to learn machine learning or other advanced techniques.

Setting Up

This book is intended to be used with Anaconda distribution of Python, which includes JupyterLab. Members of the Social Science Computing Cooperative can log into Winstat, where Anaconda is pre-installed and ready for your use.

This book was written using Python 3.11.5 and Pandas 2.1.1. Future versions of either may break some of the example code in this book (it’s happened before). You could solve that problem by setting up a Conda environment, but that’s not a great place to start your adventure with Python. You can probably work around any issues.

The example files for the class can be found at https://sscc.wisc.edu/sscc/pubs/dwp/dwp.zip. You’ll download them as a Zip file which you’ll need to unzip before using it. The zip file contains a folder called dwp (Data Wrangling in Python) that contains all the individual files. Winstat users should put the unzipped folder on your U: drive; if you’re using your own computer you can put it wherever is convenient.

The example files include the Notebooks used to generate this book, and you could choose to read the notebooks instead of reading the book on the web. Print-outs of DataFrames don’t fit very well in the book format, but the notebooks don’t have the convenient chapter navigation.

Getting the Most Out of This Book.

To get the most out of this book you need to be an active participant. Open JupyterLab, make Notebooks, and type in and run the example code yourself (resist the temptation to copy and paste). This will help you retain more, and ensure you get all the details right—Python is always happy to tell you when you’re wrong. Do the exercises: some of them are straightforward applications of what you just learned; others will require more creativity. Data wrangling in Python is not something you read and understand—it’s a skill you must practice.