# 1 Introduction

This book uses Stata. An R version of this book is available at Regression Diagnostics with R.

Regression diagnostics are a critical step in the modeling process.

Diagnostics for regression models are tools that assess a model’s compliance to its assumptions and investigate if there is a single observation or group of observations that are not well represented by the model. These tools allow researchers to evaluate if a model appropriately represents the data of their study.

In this book we separate diagnostics from the other parts of model selection to provide a focus on this important topic. This separation is not meant to imply that these tools are used separately from other regression modeling tools.

## 1.1 How to Use This Guide

This guide is intended to be “complete but not comprehensive.” It is “complete” in that it covers the major assumptions of regression, visual and statistical diagnostic tests (where applicable), and corrective actions. It is “not comprehensive” because this book provides only some diagnostic tests and corrective actions, and it gives limited attention to diagnostics for generalized linear models. You should do at least the tests we cover in this book.

When you are fitting and selecting a regression model,

1. Review its assumptions. Some common models’ assumptions are listed in the next chapter.

2. Test each assumption, and apply corrections if needed. Chapters 3-8 go through diagnostic tests. The examples are all general linear models, but the tests can be extended to suit other models.

3. Repeat step 2. After you have applied any corrections or changed your model in any way, you must re-check each assumption.

Some diagnostic tests are statistical, and others are visual. Statistical tests are more objective while visual tests are more informative. Just as with any statistical test, very large effects can be statistically non-significant in small samples, and very small effects can be statistically significant in large samples. Visual tests are subjective but provide more information about the nature of magnitude of an assumption violation, as well as suggesting possible corrective actions. Running both types of tests, where applicable, is highly recommended.

## 1.2 Why Run Diagnostics?

You should not consider your model complete unless you have checked your assumptions through visual and/or statistical tests. If you do not do this, you cannot trust your results.

## 1.3 Example Dataset

In each chapter, we will fit models and assess diagnostics using a sample from the 2019 American Community Survey (ACS). The sample contains 5000 individuals from Wisconsin.

This dataset contains 5000 observations of 15 variables. The variables have been renamed and in some cases recoded. The original names are in parentheses.

• household (SERIALNO): housing unit or group quarters serial number
• person (SPORDER): person number
• state (ST): state; all 55 (Wisconsin) in this sample
• age (AGEP): age in years, top-coded at 99
• other_language (LANX): indicator whether another language is spoken at home
• english (ENG): self-rated ability to speak English, if another language is spoken
• commute_time (JWMNP): travel time to work in minutes, top-coded at 200
• marital_status (MAR): marital status
• education (SCHL): educational attainment, collapsed into categories
• sex (SEX): sex (male or female)
• hours_worked (WKHP): usual hours worked per week in the past 12 months, top-coded at 99
• weeks_worked (WKWN): weeks worked in the past 12 months, (naturally) top-coded at 52
• race (RAC1P): race, with some categories collapsed
• hispanic (HISP): Hispanic origin, with categories collapsed to create binary indicator
• income (PINCP): total income in current dollars, rounded, bottom-coded at -19998, top-coded at 4209995

The full dataset and documentation are also available.

## 1.4 Resources

In addition to this book, we recommend consulting the resources below. These books are all accessible online via the UW-Madison Libraries.