9  Preparing Your Data

The examples and exercises in the previous chapters all used cleaned, prepared datasets. In real-world analyses, you will always need to wrangle your data before you analyze it. Read our online data wrangling curriculum.

In addition to general data wrangling (creating new variables, recoding existing variables, cleaning values, etc.), Blimp also requires that our datasets have only numeric variables and use a single missing code. The missing code can be some number we identify with the MISSING command in our script, such as -99, or the string “NA”, which is R’s default when saving a dataframe to CSV.

If our dataset has any character/string data, even in variables we are not using in a model, Blimp will return and error like this one:

ERROR: Line 2, column 1 is non-numeric.
       Only numeric data is allowed.
       Parsed data: "Wisconsin".

Below are examples of R and Stata code to convert character and categorical variables to numeric, and use a missing data code that Blimp accepts. These examples assume you have already otherwise cleaned your data. After exporting it, you can fit a model to the data with Blimp.

9.1 R

  1. Change character variables to factors.

    library(dplyr)
    dat <- 
        dat |> 
        mutate(where(is.character), as.factor)
  2. Change factor variables to numeric.

    dat <- 
        dat |> 
        mutate(where(is.factor), as.numeric)
  3. Export to CSV, keeping the default of write.csv() to export missing as “NA”.

    write.csv(dat, "dat.csv", row.names = F)

The step of exporting a dataset for use in Blimp is not needed if you are using rblimp, since it can take a dataset in memory for the data argument.

9.2 Stata

  1. Change string variables to labeled numeric variables.

    encode stringvar, gen(stringvar_numeric)
    drop stringvar
  2. Change all missing values into some value, like -99.

    mvencode _all, mv(-99)
  3. Export to CSV without value labels.

    export delimited using dat.csv, nolabel