Data Visualization in R with ggplot2
This article will teach you how to use data visualizations to understand and communicate your data with
ggplot2 (hereafter just “ggplot”).
This article is organized by the numbers and kinds of variables we would like to plot. After discussing the basic building blocks of ggplot, we will plot univariate, bivariate, and multivariate data. Then, we will discuss some of ggplot’s options for customizing plots’ appearance, and we will finish with a brief look at saving plots for use in other applications. You are strongly encouraged to follow along by running the code on your own computer.
Plotting is especially useful in the early stages of data analysis, as you seek to understand your data, and in the later stages as you visually assess model assumptions (see Regresson Diagnostics with R) and plot predicted values from fitted models (see Plotting Predicted Values: Margins Plots).
While plotting with ggplot, a cheatsheet you will come back to again and again is the Data Visualization Cheatsheet, which serves as a quick reference guide to ggplot syntax and options. It is available on RStudio’s website alongside other cheatsheets. The R Graph Gallery is also a great resource that shows you what is possible with ggplot, and it provides example code. You will undoubtedly also make use of countless other websites and Stack Exchange discussions you find when you Google “how to change axis font size ggplot.”
We will plot two datasets: a sample from the 2000 American Community Survey, and a subsample of this dataset. To load them into R, either click the links in the previous sentence and then load them with
readRDS(), or load them directly from their links with the code below.
acs <- readRDS(url("https://sscc.wisc.edu/sscc/pubs/dvr/acs.rds")) acs_small <- readRDS(url("https://sscc.wisc.edu/sscc/pubs/dvr/acs_small.rds"))
Browse the structure of the
## 'data.frame': 27410 obs. of 9 variables: ## $ household : int 37 37 37 241 242 377 418 465 465 484 ... ## $ person : int 1 2 3 1 1 1 1 1 2 1 ... ## $ age : int 20 19 19 50 29 69 59 55 47 33 ... ## $ maritalStatus: chr "Never married" "Never married" "Never married" "Never married" ... ## $ income : int 10000 5300 4700 32500 30000 51900 12200 0 2600 16800 ... ## $ female : int 1 1 1 1 1 1 1 0 1 0 ... ## $ hispanic : int 0 0 0 0 0 0 0 0 0 0 ... ## $ race : chr "White" "White" "Black" "White" ... ## $ edu : Factor w/ 5 levels "Less than High School",..: 3 3 3 5 4 1 1 3 1 2 ...
When plotting, character vectors (here,
race) are treated as factors and ordered alphabetically. For unordered categorical variables, such as state names, this default may be fine. For categorical variables with a natural order such as
edu (education), we can specify an order with
fct_relevel() from the
forcats package. This has already been done with this dataset See our chapter on working with factors for more.
Data manipulation (calculating means, filtering observations, etc.) is typically handled outside the
ggplot() call, so the examples below will make use of
dplyr’s data manipulation functions and the base R pipe operator (
|>) to prepare the data and pass it to ggplot. For a review, read our chapter on First Steps with Dataframes from Data Wrangling with R.