Regression Diagnostics with Stata
This book uses Stata. An R version of this book is available at Regression Diagnostics with R.
Regression diagnostics are a critical step in the modeling process.
Diagnostics for regression models are tools that assess a model’s compliance to its assumptions and investigate if there is a single observation or group of observations that are not well represented by the model. These tools allow researchers to evaluate if a model appropriately represents the data of their study.
In this book we separate diagnostics from the other parts of model selection to provide a focus on this important topic. This separation is not meant to imply that these tools are used separately from other regression modeling tools.
1.1 How to Use This Guide
This guide is intended to be “complete but not comprehensive.” It is “complete” in that it covers the major assumptions of regression, visual and statistical diagnostic tests (where applicable), and corrective actions. It is “not comprehensive” because this book provides only some diagnostic tests and corrective actions, and it gives limited attention to diagnostics for generalized linear models. You should do at least the tests we cover in this book.
When you are fitting and selecting a regression model,
Review its assumptions. Some common models’ assumptions are listed in the next chapter.
Test each assumption, and apply corrections if needed. Chapters 3-8 go through diagnostic tests. The examples are all general linear models, but the tests can be extended to suit other models.
Repeat step 2. After you have applied any corrections or changed your model in any way, you must re-check each assumption.
Some diagnostic tests are statistical, and others are visual. Statistical tests are more objective while visual tests are more informative. Just as with any statistical test, very large effects can be statistically non-significant in small samples, and very small effects can be statistically significant in large samples. Visual tests are subjective but provide more information about the nature of magnitude of an assumption violation, as well as suggesting possible corrective actions. Running both types of tests, where applicable, is highly recommended.
1.2 Why Run Diagnostics?
You should not consider your model complete unless you have checked your assumptions through visual and/or statistical tests. If you do not do this, you cannot trust your results.
1.3 Example Dataset
In each chapter, we will fit models and assess diagnostics using a sample from the 2019 American Community Survey (ACS). The sample contains 5000 individuals from Wisconsin.
Click here to download the sample dataset, and click here for the codebook.
This dataset contains 5000 observations of 15 variables. The variables have been renamed and in some cases recoded. The original names are in parentheses.
household(SERIALNO): housing unit or group quarters serial number
person(SPORDER): person number
state(ST): state; all 55 (Wisconsin) in this sample
age(AGEP): age in years, top-coded at 99
other_language(LANX): indicator whether another language is spoken at home
english(ENG): self-rated ability to speak English, if another language is spoken
commute_time(JWMNP): travel time to work in minutes, top-coded at 200
marital_status(MAR): marital status
education(SCHL): educational attainment, collapsed into categories
sex(SEX): sex (male or female)
hours_worked(WKHP): usual hours worked per week in the past 12 months, top-coded at 99
weeks_worked(WKWN): weeks worked in the past 12 months, (naturally) top-coded at 52
race(RAC1P): race, with some categories collapsed
hispanic(HISP): Hispanic origin, with categories collapsed to create binary indicator
income(PINCP): total income in current dollars, rounded, bottom-coded at -19998, top-coded at 4209995
The full dataset and documentation are also available.
In addition to this book, we recommend consulting the resources below. These books are all accessible online via the UW-Madison Libraries.
- Belsley, D. A., Kuh, E., and Welsch, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. Wiley. https://doi.org/10.1002/0471725153
- Best, H., & Wolf, C. (Eds.) (2014). The SAGE handbook of regression analysis and causal inference. SAGE. https://doi.org/10.4135/9781446288146
- Fox, J. D. (2020). Regression diagnostics: An introduction (2nd ed.). SAGE. https://doi.org/10.4135/9781071878651
- Osborne, J. (2013). Best practices in data cleaning: A complete guide to everything you need to do before and after collecting your data. Sage. https://doi.org/10.4135/9781452269948
- Osborne, J. (2015). Best practices in logistic regression. SAGE. https://doi.org/10.4135/9781483399041
- Osborne, J. (2017). Regression & linear modeling: Best practices and modern methods. SAGE. https://doi.org/10.4135/9781071802724