Visualizing Regression Results in R
This article will teach you how to use
plot() to visualize the marginal effects of one or more variables of interest in linear and logistic regression models. You will learn how to specify predictor values and how to fix covariates at specific values, in addition to options for customizing plots.
Marginal means are predicted outcomes given certain constraints, and a marginal effect is the predicted change in the outcome after varying a variable of interest while holding others constant.
As our models grow in complexity and dimensionality, we face increasing difficulty in interpreting coefficients. Visualizing margins helps us better understand and communicate our model results.
To begin, load the
ggeffects has the
ggpredict() function, which we will use to calculate margins, and
ggplot2 has other plotting functions.
Load a sample of the 2019 American Community Survey. The variables in this dataset are described in Regression Diagnostics with R.
acs <- readRDS(url("https://sscc.wisc.edu/sscc/pubs/data/RegDiag/acs2019sample.rds"))
Fit a linear model predicting
education, the interaction of the two,
hours_worked per week, and
mod <- lm(income ~ age * education + hours_worked + sex, acs)
Now, suppose we want to understand the marginal effect of
age for females with a bachelor’s degree who work 40 hours per week.
View the model estimates with
## ## Call: ## lm(formula = income ~ age * education + hours_worked + sex, data = acs) ## ## Residuals: ## Min 1Q Median 3Q Max ## -104651 -20303 -5457 10053 575865 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -14651.75 7303.23 -2.006 0.044933 * ## age 321.88 188.29 1.709 0.087479 . ## educationHigh school -2538.18 8806.58 -0.288 0.773204 ## educationSome college -11156.06 9145.48 -1.220 0.222629 ## educationAssociate's degree 6956.78 12208.20 0.570 0.568829 ## educationBachelor's degree -2617.29 10355.82 -0.253 0.800491 ## educationAdvanced degree 43542.41 15015.65 2.900 0.003764 ** ## hours_worked 1061.12 72.49 14.637 < 2e-16 *** ## sexFemale -15115.84 1966.42 -7.687 2.08e-14 *** ## age:educationHigh school 174.06 213.52 0.815 0.415029 ## age:educationSome college 505.64 223.04 2.267 0.023466 * ## age:educationAssociate's degree 154.85 280.84 0.551 0.581415 ## age:educationBachelor's degree 813.06 246.26 3.302 0.000973 *** ## age:educationAdvanced degree 329.19 316.10 1.041 0.297781 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 49320 on 2747 degrees of freedom ## (2239 observations deleted due to missingness) ## Multiple R-squared: 0.2391, Adjusted R-squared: 0.2355 ## F-statistic: 66.39 on 13 and 2747 DF, p-value: < 2.2e-16
Our model estimates give us the coefficient estimates and significance tests we have been trained to interpret and report, but they are less useful when we want to calculate expected outcomes.
plot() provides a visual aid for select terms. The two are complementary; neither can fully replace the other.
ggpredict(mod, terms = "age", condition = c(education = "Bachelor's degree", hours_worked = 40, sex = "Female")) |> plot()
This plot helps us visualize the marginal effect of age on income when we hold
sex at specific values. The expected increase in income with age appears to be quite substantial.
We will continue to plot margins from
mod, our regression model fit to the
We will use two functions to create margins plots:
ggeffects has an additional method for
plot() to create margins plots with
Two arguments of
ggpredict() that we will use are
model is just the name of our fitted model,
terms takes a character vector of up to four predictor names, but here we will only use three. The three terms are mapped to the ggplot aesthetics of
facet, in that order.
xis mapped to the x-axis
groups have different lines, colors, and/or shapes
facets are separate sub-plots
The three terms take different types of variables.
xcan be continuous or categorical
facetmust be categorical, or they will be made categorical
- See Margins at Specific Values to specify how continuous variables are handled (e.g., by using specific values, mean ± SD, etc.)
Our model has two continuous (
hours_worked) and two categorical variables (
sex). We can create a range of plots with different numbers and orders of these variables.
ggpredict(mod, terms = c("age")) |> plot()
ggpredict(mod, terms = c("education")) |> plot()
ggpredict(mod, terms = c("age", "hours_worked")) |> plot()
ggpredict(mod, terms = c("education", "sex")) |> plot(colors = "bw")
One continuous then one categorical:
ggpredict(mod, terms = c("age", "sex")) |> plot()
One categorical then one continuous. Compare this plot to the previous plot, noticing how
age is turned into a categorical variable since it is the
ggpredict(mod, terms = c("sex", "age")) |> plot()