11 First Steps with Dataframes
The examples on this page require an example file, which you can download by clicking on the link below. (If the data file opens up in your browser instead of downloading, right-click on the link and select “Save link as…”)
The majority of your statistical work will revolve around data sets. For statistical work, a data set is a rectangular (matrix-like) arrangement of measurements (variables, columns) collected on units of observation that are arranged in rows. This structure is crucial to keep the correct measurements (data values) connected with each observation.
The typical data analysis project begins by reading a data set into R from some external file.
In R this structure is called a dataframe. A dataframe is an ordered list of vectors, where the vectors are all the same length. A dataframe also always has column names (variable names) and row names (often just the observation number as a character value). When indexing the elements of a dataframe we always have all three methods available: by position, by name, or by condition.
A variation on the dataframe is the tidyverse tibble class.
Tibbles have their own methods for some generic functions
(like print
). They also have data.frame
as a secondary
class - any tibble can be used as a dataframe.
The following examples use a subset of the American Community Survey (ACS) from 2000. This is the same data set used in the Data Wrangling in Stata curriculum. To follow along, download the dataset here.
When you start working with a data set, especially if it was created by somebody else (that includes past-you!), resist the temptation to start running models immediately. First, take time to understand the data. What information does it contain? What is the structure of the data set? What is the data type of each column? Is there anything strange in the data set? It’s better to find out now, and not when you’re in the middle of modeling!
11.1 Start Your Script
Now, create a new script with File - New File - New R Script or by clicking on the New File icon in the toolbar. Save this script with a sensible name, such as “01_cleaning.R”. We can imagine a series of scripts we might run after this one, such as “02_descriptive_statistics.R”, “03_regression.R”, “04_plots.R”, and so on.
The first few lines of a script should load libraries and read in our data. The dplyr
package includes a wide range of functions for manipulating dataframes that are essential for basic and advanced data wrangling, and forcats
has several functions we will use for working with categorical variables.
library(dplyr)
library(forcats)
acs <- read.csv("2000_acs_sample.csv")
11.1.1 Piping
As of version 4.1.0, R has a native pipe operator, |>
. The pipe uses the result of an expression as the first argument of the following expression.
We can write our code without pipes in a nested manner:
sqrt(mean(seq(1, 5, 1)))
[1] 1.732051
Or we can use pipes to take the seq(1, 5, 1)
, give it to mean()
to calculate the average, and then take that average and pass it to sqrt()
to calculate the square root:
seq(1, 5, 1) |> mean() |> sqrt()
[1] 1.732051
Using pipes makes writing and reading code easier. As we write strings of functions, we are less likely to misplace or miscount parentheses. As we read ours and others’ code, we see the operations in the order they are done. The nested example above starts with sqrt()
, which is the last operation we carry out, while the piped code reveals that the first thing we do is create a vector of one through five.
11.2 Look at the Data
A dataframe consists of rows called observations and columns called variables. The data recorded for an individual observation are stored as values in the corresponding variable.
Variable | Variable | Variable | … | |
---|---|---|---|---|
Observation | Value | Value | Value | |
Observation | Value | Value | Value | |
Observation | Value | Value | Value | |
… |
If you have a dataset already in this format, you are in luck. However, we might run into datasets that need a little work before we can use them. A single row might have multiple observations, or a single variable might be spread across multiple columns. Organizing, or tidying, datasets is the focus of the remainder of this book.
Now that we have the acs
dataset loaded, a first step in looking at our data is checking its dimensions, row names (if it has any), and column names.
nrow(acs)
[1] 28172
rownames(acs)[1:10] # just look at the first 10
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
ncol(acs)
[1] 16
colnames(acs)
[1] "year" "datanum" "serial" "hhwt" "gq" "us2000c_serialno"
[7] "pernum" "perwt" "us2000c_pnum" "us2000c_sex" "us2000c_age" "us2000c_hispan"
[13] "us2000c_race1" "us2000c_marstat" "us2000c_educ" "us2000c_inctot"
dim(acs) # nrow() and ncol() together
[1] 28172 16
Next, the summary()
function provides simple summary statistics for numeric vectors, and str()
will, in the case of dataframes, tell us the data type and the first few values of each column.
summary(acs)
year datanum serial hhwt gq us2000c_serialno pernum
Min. :2000 Min. :4 Min. : 37 Min. :100 Length:28172 Min. : 92 Min. : 1.000
1st Qu.:2000 1st Qu.:4 1st Qu.: 323671 1st Qu.:100 Class :character 1st Qu.:2395745 1st Qu.: 1.000
Median :2000 Median :4 Median : 617477 Median :100 Mode :character Median :4905730 Median : 2.000
Mean :2000 Mean :4 Mean : 624234 Mean :100 Mean :4951676 Mean : 2.208
3rd Qu.:2000 3rd Qu.:4 3rd Qu.: 937528 3rd Qu.:100 3rd Qu.:7444248 3rd Qu.: 3.000
Max. :2000 Max. :4 Max. :1236779 Max. :100 Max. :9999402 Max. :16.000
perwt us2000c_pnum us2000c_sex us2000c_age us2000c_hispan us2000c_race1 us2000c_marstat
Min. :100 Min. : 1.000 Min. :1.000 Min. : 0.00 Min. : 1.00 Min. :1.000 Min. :1.000
1st Qu.:100 1st Qu.: 1.000 1st Qu.:1.000 1st Qu.: 17.00 1st Qu.: 1.00 1st Qu.:1.000 1st Qu.:1.000
Median :100 Median : 2.000 Median :2.000 Median : 35.00 Median : 1.00 Median :1.000 Median :3.000
Mean :100 Mean : 2.208 Mean :1.512 Mean : 35.92 Mean : 1.77 Mean :1.935 Mean :2.973
3rd Qu.:100 3rd Qu.: 3.000 3rd Qu.:2.000 3rd Qu.: 51.00 3rd Qu.: 1.00 3rd Qu.:1.000 3rd Qu.:5.000
Max. :100 Max. :16.000 Max. :2.000 Max. :933.00 Max. :24.00 Max. :9.000 Max. :5.000
us2000c_educ us2000c_inctot
Min. : 0.000 Length:28172
1st Qu.: 4.000 Class :character
Median : 9.000 Mode :character
Mean : 7.871
3rd Qu.:11.000
Max. :16.000
str(acs)
'data.frame': 28172 obs. of 16 variables:
$ year : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
$ datanum : int 4 4 4 4 4 4 4 4 4 4 ...
$ serial : int 37 37 37 241 242 296 377 418 465 465 ...
$ hhwt : int 100 100 100 100 100 100 100 100 100 100 ...
$ gq : chr "Households under 1970 definition" "Households under 1970 definition" "Households under 1970 definition" "Households under 1970 definition" ...
$ us2000c_serialno: int 365663 365663 365663 2894822 2896802 3608029 4720742 5176658 5856346 5856346 ...
$ pernum : int 1 2 3 1 1 1 1 1 1 2 ...
$ perwt : int 100 100 100 100 100 100 100 100 100 100 ...
$ us2000c_pnum : int 1 2 3 1 1 1 1 1 1 2 ...
$ us2000c_sex : int 2 2 2 2 2 2 2 2 1 2 ...
$ us2000c_age : int 20 19 19 50 29 20 69 59 55 47 ...
$ us2000c_hispan : int 1 1 1 1 1 1 1 1 1 1 ...
$ us2000c_race1 : int 1 1 2 1 1 6 1 1 2 2 ...
$ us2000c_marstat : int 5 5 5 5 5 5 5 2 4 5 ...
$ us2000c_educ : int 11 11 11 14 13 9 1 8 12 1 ...
$ us2000c_inctot : chr "10000" "5300" "4700" "32500" ...
We can learn a few things about the data:
- it seems like year
, hhwt
, and perwt
are always the same values (note that the summary statistics are all a single number)
- several variables, such as us2000c_sex
and us2000c_race1
, are integers but their names suggest categorical variables; here we should refer to the 2000 ACS codebook to recode these variables
- gq
and us2000c_inctot
are character vectors, while all others are numeric
- us2000c_age
has a maximum value of 933, which sounds impossibly high if age is in years
To check that year
and the other variables are always the same, use the table()
function to reveal that the value 2000 occurs 28172 times.
table(acs$year)
2000
28172
Alternatively, the unique()
function in conjunction with the length()
function to find the number of unique values in a vector. One unique value means that every value is identical.
unique(acs$year) # 2000 is the only value
[1] 2000
acs$year |> unique() |> length() # 1 unique value
[1] 1
We may choose to drop these columns later on, and this can be done with subset(acs, select = -year)
or select(acs, -year)
(see the chapters on Subsetting).
11.3 Renaming Variables
11.3.1 Rename Individual Columns
The rename()
function in dplyr
allows for easy renaming of individual columns. The pattern is new_name = old_name
. We can change pernum
to “person” and serial
to “household”.
acs <-
acs |>
rename(person = pernum,
household = serial)
11.3.2 Rename Multiple Columns
Several columns have the prefix “us2000c_”, which is a bit redundant since the data is all from the US and from the year 2000. Instead of renaming them one-by-one, we can rename several columns at once with rename_with()
and the sub()
function. In Substitution, we learned that sub()
allwos us to substitute one character string for another. Substituting “us2000c_” for “” (nothing) will delete this prefix.
sub()
contains three arguments. It looks through the third and replaces the first with the second. Here, go through the column names of acs
and replace the first occurrence of “us2000c_” with ““. Within rename_with()
, we can reference the column names as .x
.
colnames(acs)
[1] "year" "datanum" "household" "hhwt" "gq" "us2000c_serialno"
[7] "person" "perwt" "us2000c_pnum" "us2000c_sex" "us2000c_age" "us2000c_hispan"
[13] "us2000c_race1" "us2000c_marstat" "us2000c_educ" "us2000c_inctot"
acs <-
acs |>
rename_with(~ sub("us2000c_", "", .x))
colnames(acs)
[1] "year" "datanum" "household" "hhwt" "gq" "serialno" "person" "perwt" "pnum"
[10] "sex" "age" "hispan" "race1" "marstat" "educ" "inctot"
11.4 Creating Variables
mutate()
is a variable creation and replacement function. If a new variable name is supplied, a new variable is created. If an existing variable name is supplied, that variable is replaced without any warning.
11.4.1 Numeric
One way we can create a numeric variable is by multiplying a single existing column by a constant. Multiplying age (assumed to be in years) by 12 results in age in months. The variable age_months
does not currently exist in acs
, so a new variable is created.
acs <-
acs |>
mutate(age_months = age * 12)
Variables can also be created from multiple existing columns, through addition, multiplication, logarithms, averages, minimums, or any combination of functions and operators.
11.4.2 Character
In our dataset, the identifier is currently spread out across two variables: household and person. We can put these two together with paste()
so that we have a single variable that uniquely identifies observations.
acs <-
acs |>
mutate(id = paste(household, person, sep = "_"))
11.4.3 Categorical
Categorical variables can be of type numeric or factor. Numeric categorical variables work when we have only two categories, and we code them as 0 and 1. These are also called dummy or indicator variables. Factor categorical variables work with any number of categories.
We can create a new indicator variable called female
that contains 0 for male and 1 for female. The sex
column is 1s and 2s. According to the codebook, these correspond to male and female, respectively. If sex == 2
, we can assign the value 1, and if not, 0.
acs <-
acs |>
mutate(female = ifelse(sex == 2, 1, 0))
We could also have recoded our original sex
variable to take values Male and Female by supplying character values in ifelse()
’s second and third arguments. Here we assign the result to sex
, a column that already exists, so the original column is replaced.
acs <-
acs |>
mutate(sex = ifelse(sex == 2, "Female", "Male"))
Let’s recode the gq
column too. First, look at the values this variable can take.
unique(acs$gq)
[1] "Households under 1970 definition" "Other group quarters"
[3] "Group quarters--Institutions" "Additional households under 1990 definition"
Perhaps we only want two categories: Households and Group Quarters. We can use fct_collapse()
to reduce the number of categories:
acs <-
acs |>
mutate(gq_recode = fct_collapse(gq,
"Households" = c("Households under 1970 definition",
"Additional households under 1990 definition"),
"Group Quarters" = c("Group quarters--Institutions",
"Other group quarters")))
We can also create indicators for continuous variables using other logical comparisons. Let’s create an indicator called adult
that has a 1 if an individual is at least 18 years old, and a 0 if not.
acs <-
acs |>
mutate(adult = ifelse(age >= 18, 1, 0))
All of our categorical variables so far have had only two categories. A multiple-level categorical variable can be created with case_when()
. Each argument within case_when()
follows the pattern condition ~ value_if_TRUE
. Any values that return FALSE for every condition are assigned a value of NA
.
x <- 1:5
case_when(x < 3 ~ "Less than 3",
x == 3 ~ "Equal to 3",
x > 3 ~ "Greater than 3")
[1] "Less than 3" "Less than 3" "Equal to 3" "Greater than 3" "Greater than 3"
Using the definitions from the 2000 ACS codebook (available here), we can recode educ
into a categorical variable with levels less than high school (education codes 1-8), high school (9), some college (10-12), bachelors (13), and advanced degree (14-16). A value of 0 for education means not applicable and is only used for individuals less than three years old. If we do not include 0 in any of our case_when()
statements, it will be assigned a value of NA
.
We can then use fct_relevel()
within the same mutate()
call to specify an order for our categorical variable since the default is alphabetical order.
acs <-
acs |>
mutate(educ_categories = case_when(educ >= 1 & educ <= 8 ~ "Less than High School",
educ == 9 ~ "High School",
educ >= 10 & educ <= 12 ~ "Some College",
educ == 13 ~ "Bachelors",
educ >= 14 ~ "Advanced Degree"),
educ_categories = fct_relevel(educ_categories,
"Less than High School",
"High School",
"Some College",
"Bachelors",
"Advanced Degree"))
We can confirm the coding of educ
was successful by examining the output of table()
:
table(acs$educ, acs$educ_categories, useNA = "ifany")
Less than High School High School Some College Bachelors Advanced Degree <NA>
0 0 0 0 0 0 1126
1 1317 0 0 0 0 0
2 2508 0 0 0 0 0
3 1304 0 0 0 0 0
4 1648 0 0 0 0 0
5 923 0 0 0 0 0
6 1059 0 0 0 0 0
7 906 0 0 0 0 0
8 889 0 0 0 0 0
9 0 5959 0 0 0 0
10 0 0 1578 0 0 0
11 0 0 3191 0 0 0
12 0 0 1221 0 0 0
13 0 0 0 2960 0 0
14 0 0 0 0 1068 0
15 0 0 0 0 347 0
16 0 0 0 0 168 0
We can imagine using this educ_categories
variable as a predictor in a statistical model or for creating bar graphs of income by educational attainment.
11.5 Changing Values
In addition to modifying whole variables, we can also use mutate()
to change some values within a variable.
11.5.1 Change Values to Missing
In our exploration of the data, recall that inctot
is a character vector, but the first few values shown by str()
appear to be numbers.
Open the data set with View(acs)
to see why it is a character vector. Some of the values are BBBBBBB
. The Census uses this code for missing data. We can recode the B’s as missing values with the na_if()
function while leaving the other values as they are.
acs <-
acs |>
mutate(inctot = ifelse(inctot == "BBBBBBB", NA, inctot))
At this step, the column is still a character vector, so we need to convert it into a numeric vector.
typeof(acs$inctot)
[1] "character"
acs <-
acs |>
mutate(inctot = as.numeric(inctot))
Another approach we could take if we knew our missing code in advance is to specify this when reading in the data. If you take this approach, you will need to re-run the above code of renaming columns in order to follow along for the remainder of this chapter.
acs <- read.csv("2000_acs_sample.csv", na.strings = "BBBBBBB")
11.5.1.1 Quantify Missing Data
We should now check how much data is missing from the dataframe.
We can calculate how much data is missing from acs
as a whole. To do so, first use is.na()
as a test of whether the data is missing. This will turn the entire dataframe into TRUE
and FALSE
values, where TRUE
means the data is missing. Then, take the sum or the mean. In doing so, TRUE
and FALSE
will be coerced into 1 and 0, respectively.
acs |> is.na() |> sum()
[1] 7283
acs |> is.na() |> mean()
[1] 0.01175087
A total of 6157 values are missing, 0.7% of our dataset.
To calculate missingness by individual columns, first turn the dataframe into logical values with is.na()
as above, and then take column sums or means.
acs |> is.na() |> colSums()
year datanum household hhwt gq serialno person
0 0 0 0 0 0 0
perwt pnum sex age hispan race1 marstat
0 0 0 0 0 0 0
educ inctot age_months id female gq_recode adult
0 6157 0 0 0 0 0
educ_categories
1126
acs |> is.na() |> colMeans()
year datanum household hhwt gq serialno person
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
perwt pnum sex age hispan race1 marstat
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
educ inctot age_months id female gq_recode adult
0.00000000 0.21855033 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
educ_categories
0.03996876
We now see that all of the missing values are in inctot
, and 21.9% of this variable is missing.
If you work with missing data and imputation, be sure to check out this excellent resource by Stef van Buren on using the mice package.
11.5.2 Change Values to Other Values
Earlier we saw that age
had a maximum value of 933. If we assume this variable is in years, this value seems way too high. Use table()
to see the values age
takes.
table(acs$age)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
371 391 364 423 390 388 376 392 426 450 444 403 461 437 441 396 400 367 375 407 415 362 363 412 345 352 376 378 374 397
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
441 397 373 402 387 440 466 454 462 435 466 473 454 421 431 410 440 425 395 363 361 365 357 351 285 279 275 288 266 225
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
234 226 227 230 236 185 195 187 182 203 175 187 188 186 183 155 150 138 141 121 118 109 109 98 90 74 71 67 54 39
92 93 933
1 149 1
Only one observation has a value of 933, and the next highest value is 93. We could take at least three approaches to deal with this number. We might think it should be 93 and whoever entered the data made a typo, we could change this value to missing because we are not certain what the value should have been, or we could drop this case altogether.
(Note: The original ACS dataset did not have this value of 933 for age. The value was intentionally edited from 93 to 933 for this exercise.)
If we take the third approach, we can drop this particular row by taking a subset, either with subset(acs, subset = !age %in% 933)
or filter(acs, !age %in% 933)
(see chapters on Subsetting for more). If we want to make it missing, we can use the na_if()
approach above. If we think, rather, if we know, it should be 93, we can use mutate()
in conjunction with ifelse()
.
The arguments of ifelse()
are condition, value if TRUE, and value if FALSE. The code below checks if a value of age
is 933. If it is, it changes it to 93. If not, it uses the value of that observation from age
.
acs <-
acs |>
mutate(age = ifelse(age %in% 933, 93, age))
11.6 Save Your Dataframe and Script
Now that we have cleaned up the ACS data set, it is a good idea to end the script by saving the cleaned data set.
saveRDS(acs, "acs_cleaned.rds")
By saving the resulting data set, you can now begin the next script (“02_…”) with acs <- readRDS("acs_cleaned.rds")
. This first script is your record of how you made changes to the raw data. It serves as a record to future-you, to remind you of what you did, and to colleagues and journal reviewers who have questions.
11.7 Exercises
Start a script that loads
dplyr
and thesleep
dataset.Read the documentation at
help(sleep)
.Examine the data. What type is each column? How are the data distributed? Is any data missing?
Add a new column that says “One” if
group
is 1, and “Two” ifgroup
is 2.Replace
extra
withNA
if it is below zero.Multiply
extra
by 60 so that it is minutes rather than hours.Change the name of
extra
toextra_minutes
.Save the dataset as an RDS file, and save your script.