Visualizing Regression Results in R
1 Setup
This article will teach you how to use ggpredict()
and plot()
to visualize the marginal effects of one or more variables of interest in linear and logistic regression models. You will learn how to specify predictor values and how to fix covariates at specific values, in addition to options for customizing plots.
Marginal means are predicted outcomes given certain constraints, and a marginal effect is the predicted change in the outcome after varying a variable of interest while holding others constant.
As our models grow in complexity and dimensionality, we face increasing difficulty in interpreting coefficients. Visualizing margins helps us better understand and communicate our model results.
To begin, load the ggeffects
and ggplot2
libraries. ggeffects
has the ggpredict()
function, which we will use to calculate margins, and ggplot2
has other plotting functions.
library(ggeffects)
library(ggplot2)
Load a sample of the 2019 American Community Survey. The variables in this dataset are described in Regression Diagnostics with R.
acs <- readRDS(url("https://sscc.wisc.edu/sscc/pubs/data/RegDiag/acs2019sample.rds"))
Fit a linear model predicting income
from age
, education
, the interaction of the two, hours_worked
per week, and sex
.
mod <- lm(income ~ age * education + hours_worked + sex, acs)
Now, suppose we want to understand the marginal effect of age
for females with a bachelor’s degree who work 40 hours per week.
View the model estimates with summary()
:
summary(mod)
##
## Call:
## lm(formula = income ~ age * education + hours_worked + sex, data = acs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -104651 -20303 -5457 10053 575865
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14651.75 7303.23 -2.006 0.044933 *
## age 321.88 188.29 1.709 0.087479 .
## educationHigh school -2538.18 8806.58 -0.288 0.773204
## educationSome college -11156.06 9145.48 -1.220 0.222629
## educationAssociate's degree 6956.78 12208.20 0.570 0.568829
## educationBachelor's degree -2617.29 10355.82 -0.253 0.800491
## educationAdvanced degree 43542.41 15015.65 2.900 0.003764 **
## hours_worked 1061.12 72.49 14.637 < 2e-16 ***
## sexFemale -15115.84 1966.42 -7.687 2.08e-14 ***
## age:educationHigh school 174.06 213.52 0.815 0.415029
## age:educationSome college 505.64 223.04 2.267 0.023466 *
## age:educationAssociate's degree 154.85 280.84 0.551 0.581415
## age:educationBachelor's degree 813.06 246.26 3.302 0.000973 ***
## age:educationAdvanced degree 329.19 316.10 1.041 0.297781
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49320 on 2747 degrees of freedom
## (2239 observations deleted due to missingness)
## Multiple R-squared: 0.2391, Adjusted R-squared: 0.2355
## F-statistic: 66.39 on 13 and 2747 DF, p-value: < 2.2e-16
Our model estimates give us the coefficient estimates and significance tests we have been trained to interpret and report, but they are less useful when we want to calculate expected outcomes.
ggpredict()
with plot()
provides a visual aid for select terms. The two are complementary; neither can fully replace the other.
ggpredict(mod,
terms = "age",
condition = c(education = "Bachelor's degree",
hours_worked = 40,
sex = "Female")) |>
plot()
This plot helps us visualize the marginal effect of age on income when we hold education
, hours_worked
, and sex
at specific values. The expected increase in income with age appears to be quite substantial.
2 Plotting Margins
We will continue to plot margins from mod
, our regression model fit to the acs
dataset.
We will use two functions to create margins plots: ggpredict()
and plot()
. ggeffects
has an additional method for plot()
to create margins plots with ggplot2
.
Two arguments of ggpredict()
that we will use are model
and terms
. model
is just the name of our fitted model, mod
.
terms
takes a character vector of up to four predictor names, but here we will only use three. The three terms are mapped to the ggplot aesthetics of x
, group
, and facet
, in that order.
x
is mapped to the x-axisgroup
s have different lines, colors, and/or shapesfacet
s are separate sub-plots
The three terms take different types of variables.
x
can be continuous or categoricalgroup
andfacet
must be categorical, or they will be made categorical- See Margins at Specific Values to specify how continuous variables are handled (e.g., by using specific values, mean ± SD, etc.)
2.1 Terms
Our model has two continuous (age
, hours_worked
) and two categorical variables (education
, sex
). We can create a range of plots with different numbers and orders of these variables.
One continuous:
ggpredict(mod, terms = c("age")) |> plot()
One categorical:
ggpredict(mod, terms = c("education")) |> plot()
Two continuous:
ggpredict(mod, terms = c("age", "hours_worked")) |> plot()
Two categorical:
ggpredict(mod, terms = c("education", "sex")) |> plot(colors = "bw")
One continuous then one categorical:
ggpredict(mod, terms = c("age", "sex")) |> plot()
One categorical then one continuous. Compare this plot to the previous plot, noticing how age
is turned into a categorical variable since it is the group
term.
ggpredict(mod, terms = c("sex", "age")) |> plot()
One continuous then two categorical:
ggpredict(mod, terms = c("age", "sex", "education")) |> plot()
3 Margins at Specific Values
3.1 Of Predictors
ggpredict()
will by default plot margins for all values of our x
term and certain values of our group
and facet
terms. We may want to reduce the range of continuous variables or examine margins for only some of the values a categorical variable can take.
To plot margins at specific values of terms, specify the terms as var [values]
, where var
is the variable name and values
can take on one of several forms:
- Comma-separated values:
[20, 40, 60, 80]
- For factors, specify the levels of interest without quotes:
[High school, Bachelor's degree]
- For factors, specify the levels of interest without quotes:
- Sequence with
:
:[30:40]
- Optionally, add a step size:
[20:80, by=20]
- Optionally, add a step size:
- Mean ± one standard deviation (default for continuous variables when they are
group
orfacet
):[meansd]
- First quartile, median, and third quartile:
[quart2]
See more options at ?values_at
.
We can plot margins for values of 20, 40, 60, and 80 for age
and high school and bachelor’s degree for education
:
ggpredict(mod, terms = c("age [20:80, by=20]", "education [High school, Bachelor's degree]")) |> plot()
3.2 Of Covariates
Recall that margin effects are predicted changes in the outcome while holding all else constant. We should ask, held constant at what value? ggpredict()
by default will set numeric terms to their means and factor terms to their reference level.
To change these, use the condition
argument of ggpredict()
. Supply it with a vector of var = value
pairs.
The plot below shows the marginal effect of age
when education
is high school and hours_worked
is 40.
ggpredict(mod,
terms = c("age"),
condition = c(education = "High school",
hours_worked = 40)) |>
plot()
The other term in our model is sex
, a factor. We did not specify a condition, so it was fixed at its reference level (male).
We can confirm this by adding a condition to fix sex
to its reference level, and we can see the plot is exactly the same:
ggpredict(mod,
terms = c("age"),
condition = c(education = "High school",
hours_worked = 40,
sex = "Male")) |>
plot()
4 Customizing Plot Appearance
The plot()
method from ggeffects
includes some options for customization. For further fine-tuning, we can modify our margins plots with ggplot functions like labs()
and theme()
since plot()
uses ggplot2
.
4.1 Confidence Intervals
For continuous variables, modify the ci.style
argument to get confidence intervals of four types: ribbon
(default), errorbar
, dash
, or dot
.
library(gridExtra)
grid.arrange(ggpredict(mod, terms = c("age")) |> plot(ci.style = "ribbon"),
ggpredict(mod, terms = c("age")) |> plot(ci.style = "errorbar"),
ggpredict(mod, terms = c("age")) |> plot(ci.style = "dash"),
ggpredict(mod, terms = c("age")) |> plot(ci.style = "dot"),
nrow = 2)
Alternatively, confidence intervals can be removed by setting ci = F
. This works for both continuous and categorical variables.
ggpredict(mod, terms = c("age")) |> plot(ci = F)
4.2 Values
If we have a simple plot of points with error bars, we might want to add text with predicted values.
We first need to find the names of the variables in the dataframe produced by ggpredict()
ggpredict(mod, terms = c("education")) |> as.data.frame()
## x predicted std.error conf.low conf.high group
## 1 Less than high school 42277.53 4263.354 33917.83 50637.24 1
## 2 High school 47572.19 1860.959 43923.17 51221.21 1
## 3 Some college 53875.13 2175.254 49609.83 58140.43 1
## 4 Associate's degree 56202.58 2863.881 50587.00 61818.16 1
## 5 Bachelor's degree 76247.93 2390.404 71560.76 80935.10 1
## 6 Advanced degree 100633.57 3580.973 93611.90 107655.25 1
The predicted
column contains the predicted outcome values.
Load the ggrepel
library, which has functions for intelligently positioned labels. Use geom_text_repel()
and specify that the label
aesthetic should be predicted
rounded to the nearest integer. Increase point.padding
to 5 to make the text a little further from the points.
library(ggrepel)
ggpredict(mod, terms = c("education")) |> plot() +
geom_text_repel(aes(label = round(predicted)), point.padding = 10)
4.3 Colors
The group
aesthetic is assigned colors, which we can change with the colors
argument of plot()
. Options include
bw
for black and whitegs
for gray scale- Other presets, which you can find along the y-axis of the plot produced by
show_pals()
- Be sure to verify the palette has at least as many colors as the
group
has levels.
- Be sure to verify the palette has at least as many colors as the
- A vector of color names or codes:
colors = c("red", "green", "#0000ff)
- Get color codes from this web app
The default is set1
from show_pals()
.
Use another preset from show_pals()
, “ipsum”:
ggpredict(mod, terms = c("age", "education")) |> plot(colors = "ipsum")
Manually specify two colors, since sex
has three levels:
ggpredict(mod, terms = c("age", "sex")) |> plot(colors = c("#555599", "#DD4444"))
4.4 Themes
plot()
uses its own custom ggplot theme.
Add use.theme = F
to get ggplot’s default theme.
ggpredict(mod, terms = c("age")) |> plot(use.theme = F)
Then, add theme()
with options to customize the theme:
ggpredict(mod, terms = c("age")) |> plot(use.theme = F) +
theme(panel.border = element_rect(color = "black", fill = NA),
panel.grid = element_line(color = "#EEEEEE"),
panel.background = element_rect(fill = NA))
Rr use one of ggplot’s preset themes, theme_minimal()
:
ggpredict(mod, terms = c("age")) |>
plot(use.theme = F) +
theme_minimal()
4.5 Labels
4.5.1 Titles and Axes
Labels can be modified with labs()
. Provide a series of aesthetic = text
pairs.
ggpredict(mod, terms = c("age", "sex")) |>
plot() +
labs(title = NULL,
x = "Age",
y = "Income (in US dollars)",
caption = "For individuals with less than high school education working 38 hours per week.\nData source: 2019 American Community Survey.")
4.5.2 Legend Titles
Legends are trickier to modify because a legend’s corresponding aesthetic depends on the theme and variable types.
By default, group
is mapped to color
, so changing the label for color
changes the legend title.
ggpredict(mod, terms = c("age", "education")) |>
plot() +
labs(color = "Education")
If you create a black-and-white plot (colors = "bw"
) with a continuous x
variable, however, group
uses linetype
, so the linetype
label must be adjusted.
ggpredict(mod, terms = c("age", "education")) |>
plot(colors = "bw") +
labs(linetype = "Education")
And if you create a black-and-white plot with a categorical x
variable, the shape
aesthetic is used for group
.
ggpredict(mod, terms = c("education", "sex")) |>
plot(colors = "bw") +
labs(shape = "Sex")
4.5.3 Legend Labels
Likewise, to change the level labels in the legend:
- Use the
scale_*_manual()
function with the corresponding aesthetic in place of*
- e.g.,
scale_color_manual()
for color
- e.g.,
- Set the
values
to a numeric vector of one to number of levels- e.g.,
1:2
for the two-levelsex
,1:6
for the three-leveleducation
- e.g.,
- Set
labels
to a character vector of the level names in order- e.g.,
c("M", "F")
forsex
- e.g.,
ggpredict(mod, terms = c("age", "sex")) |>
plot() +
labs(color = "Sex") +
scale_color_manual(values = 1:2, labels = c("M", "F"))
ggpredict(mod, terms = c("age", "sex")) |>
plot(colors = "bw") +
labs(linetype = "Sex") +
scale_linetype_manual(values = 1:2, labels = c("M", "F"))
ggpredict(mod, terms = c("education", "sex")) |>
plot(colors = "bw") +
labs(shape = "Sex") +
scale_shape_manual(values = 1:2, labels = c("M", "F"))
5 Interactions
Margins plots are especially useful when we have an interaction term with at least one categorical variable.
The coefficient estimates in interactions with categorical variables are adjustments to the main effects.
The utility of plotting models with interactions is that these plots can tell us where we might or might not see differences, and they can guide our modeling process by helping us decide how to relevel our categorical variables.
This section assumes you are familiar with releveling categorical variables. Learn more in the chapter on categorical variables in Data Wrangling with R.
5.1 Categorical * Continuous
Fit a model predicting income
from sex
, age
, and their interaction.
mod <- lm(income ~ sex * age, acs)
Plot the predicted values by age
and sex
.
ggpredict(mod, terms = c("age", "sex")) |>
plot()
We can see that the slope of age for males is much steeper than the slope for females. We can ask two questions about their slopes:
- Are the slopes statistically significantly different from zero?
- Are the slopes statistically significantly different from each other?
When we have an interaction with a binary variable, the model estimates can only partially answer the first question, though they can help us answer the second question.
The first question can be answered with the coefficient estimate for age
. This will tell us whether the slope of age
for the reference level of sex
is different from zero.
The second question can be answered with the coefficient estimate for the interaction between age
and sex
. This will tell us whether the slope of age
is different between the two levels of sex
.
Without releveling, the model estimates cannot tell us whether the slope of the non-reference level of sex
(female) is statistically significantly different from zero.
Look at the model estimates. Note the reference level of sex
is male.
summary(mod)
##
## Call:
## lm(formula = income ~ sex * age, data = acs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63968 -27314 -10707 11290 632470
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28096.31 2917.48 9.630 < 2e-16 ***
## sexFemale -1922.65 4193.36 -0.458 0.647
## age 419.90 55.02 7.632 2.85e-14 ***
## sexFemale:age -304.77 77.60 -3.927 8.73e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50290 on 4201 degrees of freedom
## (795 observations deleted due to missingness)
## Multiple R-squared: 0.04078, Adjusted R-squared: 0.0401
## F-statistic: 59.53 on 3 and 4201 DF, p-value: < 2.2e-16
These estimates tell us, among other things, that:
- The slope of
age
is estimated as 419.9, and it is statistically significant with p < .001. - The slope of
age
for females is estimated to be -304.77 different from the slope ofage
for males, and this difference is statistically significant with p < .001.
What we do not know, however, is whether the slope of age
for females is statistically significantly different from zero. If we add together the age
slope for males and the interaction of age
and sex
, we end up with an estimated slope of 115.13. This is closer to zero, but we do not have a significance test until we relevel our sex
variable.
We can perform this test by releveling sex
to make female the reference category.
(It should be clarified that this is the same model as before. The model fit is exactly the same, just like it would be if we centered or rescaled variables. We only applied a linear transformation to a predictor by shifting its zero point.)
library(forcats)
lm(income ~ fct_relevel(sex, "Female") * age, acs) |>
summary()
##
## Call:
## lm(formula = income ~ fct_relevel(sex, "Female") * age, data = acs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63968 -27314 -10707 11290 632470
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26173.67 3012.07 8.690 < 2e-16 ***
## fct_relevel(sex, "Female")Male 1922.65 4193.36 0.458 0.6466
## age 115.13 54.73 2.104 0.0355 *
## fct_relevel(sex, "Female")Male:age 304.77 77.60 3.927 8.73e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50290 on 4201 degrees of freedom
## (795 observations deleted due to missingness)
## Multiple R-squared: 0.04078, Adjusted R-squared: 0.0401
## F-statistic: 59.53 on 3 and 4201 DF, p-value: < 2.2e-16
This formulation of the model tells us that the slope of age
for females is statistically significantly different from zero. However, the p-value is just below .05, so we should be suspicious of this finding. Furthermore, we have not even checked our model assumptions!
5.2 Categorical * Categorical
Now, suppose we want to know about expected differences in income by two categorical variables: sex and level of education. Specifically, we want to investigate the marginal effect of high school education versus some college education for males and females.
Fit a model predicting income
from sex
, education
, and their interaction.
mod <- lm(income ~ sex * education, acs)
Plot the predicted values of income by sex
and education
.
ggpredict(mod,
terms = c("sex", "education")) |>
plot()
We can see that for males, the expected difference in income between high school education and some college education is statistically significant. Looking at the confidence intervals for these two levels of education (shown as green and blue in the plot), we see that the confidence interval for one level of education does not include the point estimate for the other level of education.
For females, however, we see that these two points have much closer predicted probabilities, and the confidence interval for either one includes the point of the other.
We can formally test these differences by releveling our predictors.
First, test the difference between the two levels of education for males. Leave sex
at its reference level (male), but change the reference level for education to high school.
lm(income ~ sex * fct_relevel(education, "High school"), acs) |>
summary()
##
## Call:
## lm(formula = income ~ sex * fct_relevel(education, "High school"),
## data = acs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98731 -20717 -7448 11083 585769
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39227 1690 23.213 < 2e-16 ***
## sexFemale -17512 2514 -6.966 3.77e-12 ***
## fct_relevel(education, "High school")Less than high school -23304 3315 -7.030 2.40e-12 ***
## fct_relevel(education, "High school")Some college 8689 2791 3.113 0.00186 **
## fct_relevel(education, "High school")Associate's degree 14421 3670 3.930 8.63e-05 ***
## fct_relevel(education, "High school")Bachelor's degree 32752 3164 10.352 < 2e-16 ***
## fct_relevel(education, "High school")Advanced degree 63704 4115 15.481 < 2e-16 ***
## sexFemale:fct_relevel(education, "High school")Less than high school 11060 4996 2.214 0.02691 *
## sexFemale:fct_relevel(education, "High school")Some college -4793 4049 -1.184 0.23666
## sexFemale:fct_relevel(education, "High school")Associate's degree -1358 5150 -0.264 0.79203
## sexFemale:fct_relevel(education, "High school")Bachelor's degree -5702 4411 -1.293 0.19620
## sexFemale:fct_relevel(education, "High school")Advanced degree -13964 5680 -2.458 0.01400 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46860 on 4193 degrees of freedom
## (795 observations deleted due to missingness)
## Multiple R-squared: 0.1685, Adjusted R-squared: 0.1663
## F-statistic: 77.25 on 11 and 4193 DF, p-value: < 2.2e-16
The coefficient for some college is statistically significant, confirming what we saw in the plot.
Now, also change the reference level for sex
.
lm(income ~ fct_relevel(sex, "Female") * fct_relevel(education, "High school"), acs) |>
summary()
##
## Call:
## lm(formula = income ~ fct_relevel(sex, "Female") * fct_relevel(education,
## "High school"), data = acs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98731 -20717 -7448 11083 585769
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21715 1861 11.668 < 2e-16 ***
## fct_relevel(sex, "Female")Male 17512 2514 6.966 3.77e-12 ***
## fct_relevel(education, "High school")Less than high school -12244 3738 -3.276 0.001062 **
## fct_relevel(education, "High school")Some college 3897 2934 1.328 0.184179
## fct_relevel(education, "High school")Associate's degree 13063 3613 3.616 0.000303 ***
## fct_relevel(education, "High school")Bachelor's degree 27050 3074 8.801 < 2e-16 ***
## fct_relevel(education, "High school")Advanced degree 49740 3916 12.702 < 2e-16 ***
## fct_relevel(sex, "Female")Male:fct_relevel(education, "High school")Less than high school -11060 4996 -2.214 0.026906 *
## fct_relevel(sex, "Female")Male:fct_relevel(education, "High school")Some college 4793 4049 1.184 0.236657
## fct_relevel(sex, "Female")Male:fct_relevel(education, "High school")Associate's degree 1358 5150 0.264 0.792026
## fct_relevel(sex, "Female")Male:fct_relevel(education, "High school")Bachelor's degree 5702 4411 1.293 0.196195
## fct_relevel(sex, "Female")Male:fct_relevel(education, "High school")Advanced degree 13964 5680 2.458 0.014004 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46860 on 4193 degrees of freedom
## (795 observations deleted due to missingness)
## Multiple R-squared: 0.1685, Adjusted R-squared: 0.1663
## F-statistic: 77.25 on 11 and 4193 DF, p-value: < 2.2e-16
For females, the coefficient estimate for some college is not statistically significant at p = .18, again confirming what we saw in the plot.
This approach of first examining margins plots helps guide our modeling, and it prevents us from making incorrect inferences about effects across the levels of categorical variables. If we did not visualize the results and relevel sex
, we may have incorrectly assumed that there was a significant difference between two levels of education, but in fact this was only true for males and not females!
6 Logistic Regression
The examples on this page so far have all used linear regression, but ggpredict()
can help us visualize results from many kinds of models, including logistic regression.
As an example, we can first add a binary variable indicating whether somebody has a higher education degree, and then fit a model predicting this variable.
library(dplyr)
acs <-
acs |>
mutate(higher_ed_degree = as.numeric(education %in% c("Bachelor's degree", "Advanced degree")))
mod <-
glm(higher_ed_degree ~ age * sex + commute_time + weeks_worked + hours_worked,
acs,
family = binomial)
Neither the ggpredict()
nor the plot()
functions need anything special. We can continue to use them just as we did for linear regression. ggpredict()
will calculate the predicted probabilities associated with the terms we provided.
ggpredict(mod, terms = c("age", "sex")) |>
plot()
7 Saving Plots
Use ggsave()
to save plots. This function saves the most recently created plot, and it supports several file extensions. See Using R Plots in Documents for more on ggsave()
.
If you are using the default theme from ggeffects
, add bg = "white"
. Otherwise, the background will be transparent.
ggsave("plot.png", width = 6, height = 4, bg = "white")
8 Exercises
Continue to use the ACS dataset, or pick one of the following datasets from ggplot2
:
msleep
, mammals’ sleep statistics and other characteristicsmidwest
, data on Midwest counties from the 2000 Censusdiamonds
, attributes of diamonds
You may also pick any other dataset to which you have access.
Fit a model with at least three predictors. Do not concern yourself with whether the model makes sense or violates assumptions.
Create two margins plots, where one has at least two terms.
Plot margins at specific values of your terms.
Plot margins under custom constraints on the covariates.
Change the colors of your plot.
Change the labels of the axes, title, and legend.
Save your plot.