2 Model Assumptions

2.1 The General Linear Model

The general linear model is fit with R’s lm() and Stata’s regress. It can be written as

\(Y = X \beta + \epsilon\)

where

  • \(Y\) is a vector of dependent variable (outcome) values
  • \(X\) is a matrix of independent variable (predictor) values
  • \(\beta\) is a vector of regression parameter coefficients (including the intercept)
  • \(\epsilon\) is a vector of errors, the sum of all other influences on \(Y\) after accounting for \(X\)

This model makes several assumptions about the error term \(\epsilon\):

  1. Linearity: \(E(Y) = X \beta\). The expected (mean) outcome is a linear function of the predictors. This implies that the expected error, conditional on \(X\), is zero: \(E(\epsilon | X) = 0\).

  2. Normality: \(\epsilon_i \sim \mathcal{N}(0, \sigma^{2}_\epsilon)\). Errors follow a normal distribution.

  3. Homoscedasticity: \(Var(\epsilon | X) = \sigma^2\). Errors, conditional on X, have constant variance.

  4. Independence: \(Cov(\epsilon_i , \epsilon_j | X) = 0\). Errors do not covary after conditioning on \(X\).

Other assumptions include:

  1. No multicollinearity: Independent variables cannot be predicted from each other.

  2. No outlier effects: The model represents the data well, and no observations disproportionately influence the model fit.

  3. No measurement error: Predictors are measured perfectly.

  4. No specification error: The statistical model matches the data generating process in its functional form and in which variables are included.

2.2 Generalizations and Extensions

Generalized linear models are generalizations of the general linear model. In the general linear model, it is the outcome variable (response) that is assumed to be a linear function of the predictors. Generalized linear models, through the use of the a link function, make it so that the transformed outcome is assumed to be a linear function of the predictors. In these models, the error variance is not assumed to be normal or of constant variance. Common generalized linear models include logistic, ordinal, and Poisson regression.

There is also an ever-growing list of extensions to the general linear model that help us address problems like endogeneity, dependence, multicollinearity, and measurement error. They also come in generalized forms, such as logistic survey models.

Some extensions can help deal with violations of linearity (e.g., instrumental variables estimation) or independence (panel, time series, survey, and mixed effects models). Others, like structural equation modeling, can help address multiple problems at the same time, including multicollinearity and measurement error.

Be aware that while generalizations and extensions to the general linear model allow us to ignore some assumptions, they often require us to meet even more. For example, mixed effects models account for the dependence in datasets, but they make additional (and testable!) assumptions about the relationships of predictors, residuals, and random effects within and across levels.