4  Regression

The files for this page’s example analysis are

All files are available at this link.

The following issues are discussed:

4.1 Example Model

The dataset for this example has 1000 observations. 8.8% of x1 is missing. x1 is missing completely at random (MCAR); its missingness is unrelated to any observed or unobserved values.

Below is a table of estimates from a model predicting y with (1) the complete data and (2) the data after some values in x1 were made missing, using a listwise deletion procedure.

  Complete Data Listwise Deletion
Parameter Estimate SE CI Estimate SE CI
Intercept -0.02 0.03 -0.08 – 0.04 -0.02 0.03 -0.08 – 0.04
x1 0.96 0.03 0.90 – 1.03 0.97 0.03 0.90 – 1.03
x2 1.01 0.03 0.95 – 1.07 1.01 0.03 0.94 – 1.07
Observations 1000 912

As expected, under MCAR, estimates are unbiased.

4.2 Blimp Input

The Blimp script for this analysis is as follows:

DATA: 01_regression.csv;
FIXED: x2;
MODEL: y ~ x1 x2; # regress y on x1 and x2
SEED: 123;
BURN: 5000;
ITERATIONS: 5000;

Select sections of the script are now discussed.

4.2.1 Basic Syntax

Each line in a Blimp script follows this pattern:

COMMAND: value;

The commands are not case-sensitive, but capitalizing them improves readability. Line breaks are ignored, and multiple values can be supplied so long as each ends with a semicolon (e.g., multiple formulas as in the mediation example:

COMMAND:
value;
COMMAND:
value;
value;

Comments are done with the # symbol. Anything after a # on the same line is ignored.

4.2.2 Missing Data

By default, Blimp will only interpret NA as missing. The data file has NA for missing, so nothing needs to be specified.

Any exogenous variables without any missing data can be included after the FIXED command, which we have done here for x2. This speeds up computation since Blimp uses a round-robin technique for imputing predictors, where each predictor is regressed on all the others. Specifying a variable as fully observed via the FIXED command skips the model predicting that variable.

4.2.3 Model Syntax

Regression formulas are specified as

outcome ~ predictor1 predictor2;

where the anything to the left of ~ is an outcome, and anything to the right is a predictor.

4.2.4 Random Seeds

Blimp requires a random seed. Doing so ensures that the results are the same each time we run the script, as long as nothing else has changed. Any number can be used as a random seed: 123 (if you’re not feeling creative), 1, your zip code, today’s date, your birthday, your favorite number, etc.

4.2.5 Iterations

We must also specify the number of burn-in (BURN) and post-burn-in iterations (ITERATIONS).

MCMC chains start at random values for each parameter, and we discard early iterations to remove the effect of their starting values. This period is called the burn-in or “warm-up.”1 We then keep the post-burn-in iterations to get a posterior distribution. The default number of chains is two, which is sufficient but can be modified with the CHAINS command. Each chain discards BURN iterations, and then ITERATIONS is split among the chains. Each chain, therefore, has a total number of iterations of \(BURN + \frac{ITERATIONS}{Number\: of\: chains}\). In our example with 5000 for each command, each chain completes 7500 iterations, of which we only keep the last 2500.

As a starting point, you may try starting with 5000 burn-in and post-burn-in iterations, and then add to that depending on convergence.

4.3 Blimp Output

Running the Blimp script produces this output file:


---------------------------------------------------------------------------

                                   Blimp
                                  3.2.10

             Blimp was developed with funding from Institute of
           Education Sciences awards R305D150056 and R305D190002.

            Craig K. Enders, P.I. Email: cenders@psych.ucla.edu
           Brian T. Keller, Co-P.I. Email: btkeller@missouri.edu
                 Han Du, Co-P.I. Email: hdu@psych.ucla.edu
                 Roy Levy, Co-P.I. Email: roy.levy@asu.edu

              Programming and Blimp Studio by Brian T. Keller

                   There is no expressed license given.

---------------------------------------------------------------------------


ALGORITHMIC OPTIONS SPECIFIED:

  Imputation method:                 Fully Bayesian model-based
  MCMC algorithm:                    Full conditional Metropolis sampler with
                                       Auto-Derived Conditional Distributions
  Between-cluster imputation model:  Not applicable, single-level imputation
  Prior for random effect variances: Not applicable, single-level imputation
  Prior for residual variances:      Zero sum of squares, df = -2 (PRIOR2)
  Prior for predictor variances:     Unit sum of squares, df = 2 (XPRIOR1)
  Chain Starting Values:             Random starting values



BURN-IN POTENTIAL SCALE REDUCTION (PSR) OUTPUT:

  NOTE: Split chain PSR is being used. This splits each chain's
        iterations to create twice as many chains.

  Comparing iterations across 2 chains     Highest PSR   Parameter #  
                          126 to 250             1.025            10  
                          251 to 500             1.014             4  
                          376 to 750             1.017            10  
                          501 to 1000            1.007             1  
                          626 to 1250            1.006             8  
                          751 to 1500            1.004            10  
                          876 to 1750            1.008            11  
                         1001 to 2000            1.002            11  
                         1126 to 2250            1.003            11  
                         1251 to 2500            1.003            11  
                         1376 to 2750            1.001             3  
                         1501 to 3000            1.001             7  
                         1626 to 3250            1.002             5  
                         1751 to 3500            1.002            11  
                         1876 to 3750            1.002            11  
                         2001 to 4000            1.001            11  
                         2126 to 4250            1.002            10  
                         2251 to 4500            1.001             5  
                         2376 to 4750            1.001             7  
                         2501 to 5000            1.001             7  


METROPOLIS-HASTINGS ACCEPTANCE RATES:

  Chain 1:

    Variable                            Type    Probability   Target Value 
    x1                            imputation          0.499          0.500 

  NOTE: Suppressing printing of 1 chains.
        Use keyword 'tuneinfo' in options to override.


DATA INFORMATION:

  Sample Size:              1000
  Missing Data Rates:

                        y = 00.00
                       x1 = 08.80



MODEL INFORMATION:

  NUMBER OF PARAMETERS
    Outcome Models:         4
    Predictor Models:       3

  PREDICTORS
    Fixed variables:        x2
    Incomplete continuous:  x1

  MODELS
    [1]  y ~ Intercept x1 x2


WARNING MESSAGES:

  No warning messages.


MODEL FIT:


  INFORMATION CRITERIA

    Marginal Likelihood
      DIC2                  2797.690
      WAIC                  2814.756

    Conditional Likelihood
      DIC2                  2797.690
      WAIC                  2814.756


CORRELATIONS AMONG RESIDUALS:

  No residual correlations.



OUTCOME MODEL ESTIMATES:

  Summaries based on 5000 iterations using 2 chains.


Outcome Variable:  y          

Parameters                           Median     StdDev       2.5%      97.5%        PSR      N_Eff 
                                -------------------------------------------------------------------
Variances:                      
  Residual Var.                       0.938      0.044      0.859      1.029      1.002   3527.214 

Coefficients:                   
  Intercept                          -0.021      0.031     -0.084      0.040      1.001   4155.296 
  x1                                  0.964      0.034      0.898      1.031      1.001   3914.456 
  x2                                  1.009      0.031      0.951      1.069      1.001   3898.620 

Standardized Coefficients:      
  x1                                  0.509      0.015      0.479      0.538      1.002   3884.470 
  x2                                  0.572      0.014      0.543      0.599      1.001   3662.815 

Proportion Variance Explained   
  by Coefficients                     0.721      0.013      0.696      0.744      1.003   3567.152 
  by Residual Variation               0.279      0.013      0.256      0.304      1.003   3567.152 

                                -------------------------------------------------------------------




PREDICTOR MODEL ESTIMATES:

  Summaries based on 5000 iterations using 2 chains.


Missing predictor:  x1         

Parameters                           Median     StdDev       2.5%      97.5%        PSR      N_Eff 
                                -------------------------------------------------------------------

Grand Mean                           -0.013      0.031     -0.072      0.049      1.000   4300.645 

Level 1:                                                                                           
  x2                                  0.216      0.029      0.159      0.273      1.001   3883.537 
  Residual Var.                       0.888      0.041      0.812      0.974      1.000   3451.772 
                                                                                                   
                                -------------------------------------------------------------------

Select sections of the output are now discussed.

4.3.1 Outcome Models

First off, how well did our model recover the true values? The Outcome Variable: y section contains a table of estimates. For each parameter, we get a measure of central tendency (median) and measures of spread (SD and 95% credible interval upper and lower limits). We also get PSR and N_Eff statistics; see the next section on convergence.

The intercept was simulated to be about zero, and the coefficients of x1 and x2 to be about one. These values are close to the parameter medians and well within their credible intervals. We would conclude that there is a 95% probability that the true effect of x1 on y is between 0.898 and 1.031.

Importantly, our model used all 1000 observations, even though x1 had missing values. At each iteration, plausible values for x1 were imputed, the focal model was estimated, estimates were saved, new values were imputed, and so on. Just like that, with such simple syntax, we fit a Bayesian regression model to incomplete data!

4.3.2 Convergence

How do we know if our model converged? What is convergence, exactly?

Summary:

  • If the potential scale reduction factor for any parameter is less than 1.05, increase the burn-in iterations.
  • If the number of effective samples for any parameter is less than 100, increase the post-burn-in iterations.
  • If convergence issues persist, revise the model.

4.3.2.1 Mixing and Stationarity

We say that a model has converged if its MCMC chains are mixed and stationary. Mixing means that the chains are exploring the same parameter space (i.e., looking around the same values), and stationarity means the chains have settled in that parameter space (i.e., not wandering around).

It will be helpful to see what different combinations of mixed and not-mixed, stationary and not-stationary look like:

These examples are simulated to be clear-cut, but real trace plots from real models with real data are less clear.

Inspection of [trace] plots is a notoriously unreliable method of assessing convergence and in addition is unwieldy when monitoring a large number of quantities of interest.2

Instead, we can use potential scale reduction (PSR) factors.

4.3.2.2 Potential Scale Reduction

PSRs are essentially a weighted combination of total variance divided by the within-chain variance. Each chain is split in two so that we get twice as many half-chains as chains. The PSR’s minimum is 1, where there is no between-chain variance. A low PSR indicates that we have achieved both mixing and stationarity. Less mixing would mean greater between-chain variance, and less stationarity would mean greater variance between a single chain’s two halves. Either of these would increase the numerator in the calculation of the PSR, meaning the PSR is a sufficient convergence diagnostic statistic.

The practical meaning of the PSR is indicated by its name; it is an estimate for how much a parameter’s posterior variance would decrease if the chain could continue forever. In other words, would we get a more accurate posterior distribution if we let the chains run longer?

The beginning of the Blimp output describes the PSRs for the burn-in period. The MCMC chains are taken in small pieces, and the greatest PSR is reported along with which parameter it belonged to. The identity of these parameters can be found by adding to our script:

OPTIONS: labels;

Further down in the output, each parameter’s PSR is reported for the post-burn-in iterations.

What we want to see is that all PSRs are below 1.05 by the end of the burn-in period. If not, increase the number of burn-in iterations to allow the chains to find and settle in a common posterior distribution.

4.3.2.3 Number of Effective Samples

Another convergence diagnostic is the number of effective samples. This is found in the rightmost column of the tables of parameter estimates, N_Eff. Iterations in MCMC chains are necessarily autocorrelated, and the effective sample size is an estimate of the sample size for that parameter after correcting for dependence. In order to get an unbiased estimate of the posterior variance, we want N_Eff to be at least 100. If it is under that, increase the number of post-burn-in iterations.

4.3.2.4 Continuing Convergence Issues

If the PSRs and/or numbers of effective samples do not meet their cutoffs (PSR < 1.05; N_Eff ≥ 100) after increasing the numbers of iterations to many tens of thousands or hundreds of thousands, revisit the model form. You may need to simplify the model (e.g., drop predictors from focal model, not estimate auxiliary models, remove random slopes, simplify selection models).

4.3.3 Fit Statistics

The WAIC, a Bayesian analog of the AIC, is reported. These statistics, like the AIC and BIC, are meaningless on their own, but they can be used for model comparison, where lower values indicate better fit.

With unclustered data, the marginal and conditional WAIC will be the same. The marginal WAIC tells us about generalizing to new clusters, and the conditional WAIC to new individuals.3 Model comparisons should take both into account, as changes in fit may show up at either level.4

4.3.4 Predictor Models

Blimp regresses each missing predictor (any predictor not specified after FIXED) on all the other predictors at each iteration, and these model estimates are also returned. Here, x1 was regressed on x2. These models do not need to be interpreted, but they give us insight into the relationships between our predictors, and their estimates informed posterior distributions of our focal model.

4.4 Exercise

Fit a regression model with the dataset exercise_01.csv. (All files are available at this link.) Regress y on x1 and x2. You have reason to believe y is missing completely at random. The coefficients on both predictors should be around 1.


  1. Gelman et al., 2021, p. 282↩︎

  2. Gelman et al., 2021, p. 285↩︎

  3. Merkle, E. C., Furr, D., & Rabe-Hesketh, S. (2019) Bayesian comparison of latent variable models: Conditional versus marginal likelihoods. Psychometrika, 84(3), 802-829. https://doi.org/10.1007/s11336-019-09679-0↩︎

  4. Gelman, A., Vehtari, A., Simpson, D., Margossian, C. C., Carpenter, B. Yao, Y., Kennedy, L., Gabry, J., Bürkner, P.-C., & Modrák, M. (2020). Bayesian workflow. arXiv. https://doi.org/10.48550/arXiv.2011.01808↩︎