# 9 Categorical

We can divide data into two general categories: continuous and categorical. Continuous data is numeric, has a natural order, and can potentially take on an infinite number of values. Examples include age, income, and health care expenditures. In contrast, categorical data takes on a limited number of values and may or may not have a natural order. Examples without a natural order include race, state of residence, and political affiliation. Examples with a natural order include Likert scale items (e.g., disagree, neutral, agree), socioeconomic status, and educational attainment.

The distinction between continuous and categorical variables is fundamental to how we use them the analysis. For example, in a regression model, continuous variables give us slopes while categorical variables give us intercepts.

In R, categorical data is managed as *factors*. We specify which variables are factors when we create and store them, and then they are treated as categorical variables in a model without any additional specification. This contrasts with other software like Stata, SAS, and SPSS, where we specify which variables are categorical in our model syntax.

## 9.1 Factor Structure

Understanding how R represents factors will help us understand how to manipulate them.

Let’s begin with a simple character vector, `x`

, that contains the letters a, b, c, and d.

`x <- c("a", "b", "c", "b", "d", "a", "a")`

When we print `x`

, R quotes the values:

`x`

`[1] "a" "b" "c" "b" "d" "a" "a"`

Now, make a factor called `y`

from `x`

with the `factor()`

function. `y`

prints differently than `x`

. It does not have quotes around its values, and it contains a line about something called “levels,” which contains the unique values of `y`

.

```
y <- factor(x)
y
```

```
[1] a b c b d a a
Levels: a b c d
```

A factor is composed of two parts: an integer vector and an ordered label vector. To see the underlying integer vector of `y`

, we can coerce it to numeric with `as.numeric()`

:

`as.numeric(y)`

`[1] 1 2 3 2 4 1 1`

And we can see the ordered labels with the `levels()`

function:

`levels(y)`

`[1] "a" "b" "c" "d"`

The label “a” is the first level, “b” the second, “c” the third, and “d” the fourth. Factor levels are ordered alphabetically by default. We could make a key as follows:

Integer | Label |
---|---|

1 | a |

2 | b |

3 | c |

4 | d |

Now, our `y`

factor is actually an integer vector, but when we print it, R shows the corresponding labels.

Compare the output from `y`

and `as.numeric(y)`

once again, and refer to the table above.

```
y
[1] a b c b d a a
Levels: a b c d
as.numeric(y)
[1] 1 2 3 2 4 1 1
```

## 9.2 Factor Defaults

When we include a character variable when plotting or modeling in R, R treats it as a factor, and its default is to

- sort levels alphabetically by their label
- leave the existing labels as-is
- include all unique values

Why are these problems?

To see why this is not ideal, let’s create some survey response data, where individuals could indicate whether they agreed, neither agreed nor disagreed, or disagreed with a statement, or if the statement did not apply to them (“not applicable”). Individuals could also choose to not answer the question (“no answer”).

```
answers <- c("AGREE",
"NEITHER",
"DISAGREE",
"NOT APPLICABLE",
"NO ANSWER")
responses <- sample(answers, 100, replace = T)
```

Now, make a plot of counts by response.

```
library(ggplot2)
ggplot() + geom_bar(aes(responses))
```

(See more on using `ggplot2`

in Data Visualization in R with ggplot2.)

We can see R’s defaults in action:

- The variable was ordered alphabetically, but it would make more sense to have “neither” between “agree” and “disagree.”
- The labels were used as-is, but we might choose to not use all-uppercase labels to improve the appearance of the plot (i.e., “Agree” rather than “AGREE”).
- All values in our data were included, but we might consider combining the two categories of “no answer” and “not applicable” to a single “missing.”

We will learn how to manipulate these three aspects of factors in R: order (releveling), labels (recoding), and number of categories (collapsing). For all of these operations, we will be making use of the `forcats`

library, which makes it easy to manipulate factors.

## 9.3 Releveling

When we relevel a factor, we change the order of its levels. If we give `fct_relevel()`

the name of a factor and the name of a level, it moves that level to the first position and leaves the others in their current order.

First, load the `forcats`

library.

`library(forcats)`

Using `y`

from earlier, create `y_relevel`

, which has “b” instead of “a” as its first level.

`y_relevel <- fct_relevel(y, "b")`

The levels of `y_relevel`

are as follows:

Integer | Label |
---|---|

1 | b |

2 | a |

3 | c |

4 | d |

Compare the output from `y`

and `y_relevel`

when printing them as is, printing their underlying integers, and printing their levels.

```
y
[1] a b c b d a a
Levels: a b c d
y_relevel
[1] a b c b d a a
Levels: b a c d
as.numeric(y)
[1] 1 2 3 2 4 1 1
as.numeric(y_relevel)
[1] 2 1 3 1 4 2 2
levels(y)
[1] "a" "b" "c" "d"
levels(y_relevel)
[1] "b" "a" "c" "d"
```

The data still prints the same, but the order of the levels has changed, and this is reflected in the integer output. In `y_relevel`

, “b” is 1 and “a” is 2.

### 9.3.1 Changing the Reference Level

We can use R’s built-in `chickwts`

dataset, which has a factor variable called `feed`

.

`str(chickwts)`

```
'data.frame': 71 obs. of 2 variables:
$ weight: num 179 160 136 227 217 168 108 124 143 140 ...
$ feed : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...
```

The first, or reference, level of `feed`

is “casein”:

`levels(chickwts$feed)`

`[1] "casein" "horsebean" "linseed" "meatmeal" "soybean" "sunflower"`

If we make box plots of weight by feed, we see that “casein” is the first variable on the x-axis:

`plot(weight ~ feed, chickwts)`

And if we predict weight by feed in a linear model, we will get this output:

`lm(weight ~ feed, chickwts)`

```
Call:
lm(formula = weight ~ feed, data = chickwts)
Coefficients:
(Intercept) feedhorsebean feedlinseed feedmeatmeal feedsoybean feedsunflower
323.583 -163.383 -104.833 -46.674 -77.155 5.333
```

Our reference level, casein, is omitted since it is represented by the intercept. The other feeds’ coefficients are comparisons to the reference category, or adjustments to the intercept. In R model output, the coefficients for the levels of a factor variable are labeled with the variable name followed by the level label, so the coefficient for the `horsebean`

level of the `feed`

variable is called `feedhorsebean`

.

If we want to change the ordering of a categorical variable in a plot or change the reference level in a statistical model, we should relevel our factor. We can make another variable in `chickwts`

called `feed2`

where “soybean” is the reference level:

`chickwts$feed2 <- fct_relevel(chickwts$feed, "soybean")`

Now, if we plot weight by feed with `feed2`

, “soybean” is the first value on the x-axis, followed by “casein” and all the other levels.

`plot(weight ~ feed2, chickwts)`

And if we fit our model again with `feed2`

, we see that the intercept has changed since it now represents the expected value of weight when the feed is set to “soybean.” If we want to calculate the expected value for casein, we would add its coefficient (77) to the intercept (246), resulting in an estimate of 323. This matches the intercept in our previous model!

`lm(weight ~ feed2, chickwts)`

```
Call:
lm(formula = weight ~ feed2, data = chickwts)
Coefficients:
(Intercept) feed2casein feed2horsebean feed2linseed feed2meatmeal
246.43 77.15 -86.23 -27.68 30.48
feed2sunflower
82.49
```

### 9.3.2 Changing All the Levels

So far we have only specified one level with `fct_relevel()`

, but we can specify as many levels as we want, up to the number of levels in our factor. Anything we do not name will follow those we do name.

This is useful when our factor levels have a natural order, like responses on a Likert scale:

```
survey <- c("Strongly agree",
"Agree",
"Neither agree nor disagree",
"Disagree",
"Strongly disagree")
survey <- factor(survey)
levels(survey)
[1] "Agree" "Disagree" "Neither agree nor disagree"
[4] "Strongly agree" "Strongly disagree"
survey <- fct_relevel(survey,
"Strongly agree",
"Agree",
"Neither agree nor disagree",
"Disagree",
"Strongly disagree")
levels(survey)
[1] "Strongly agree" "Agree" "Neither agree nor disagree"
[4] "Disagree" "Strongly disagree"
```

## 9.4 Recoding

Recoding is changing the labels on our factors. To do this, we can supply `fct_recode()`

with our factor and a series of `new_label`

= `current_label`

pairs. Anything we do not name will be left in its existing state.

Using `y`

again, create `y_recode`

, which has full fruit names instead of just their first letters:

```
y_recode <- fct_recode(y,
"apple" = "a",
"banana" = "b",
"cantaloupe" = "c",
"durian" = "d")
```

The levels of `y_recode`

are as follows:

Integer | Label |
---|---|

1 | apple |

2 | banana |

3 | cantaloupe |

4 | durian |

Only the labels of the data are changed when recoding. The order and the underlying numeric values remain the same.

```
y
[1] a b c b d a a
Levels: a b c d
y_recode
[1] apple banana cantaloupe banana durian apple apple
Levels: apple banana cantaloupe durian
as.numeric(y)
[1] 1 2 3 2 4 1 1
as.numeric(y_recode)
[1] 1 2 3 2 4 1 1
levels(y)
[1] "a" "b" "c" "d"
levels(y_recode)
[1] "apple" "banana" "cantaloupe" "durian"
```

With `chickwts`

, we can change how one or more levels are labeled. Perhaps two of the feeds were produced by two companies (A and B), and we can indicate that by changing the labels:

```
chickwts$feed3 <- fct_recode(chickwts$feed,
"Company A" = "soybean",
"Company B" = "meatmeal")
levels(chickwts$feed)
[1] "casein" "horsebean" "linseed" "meatmeal" "soybean" "sunflower"
levels(chickwts$feed3)
[1] "casein" "horsebean" "linseed" "Company B" "Company A" "sunflower"
```

Use the `table()`

function after recoding to verify that the recoding worked as intended:

`table(chickwts$feed, chickwts$feed3)`

```
casein horsebean linseed Company B Company A sunflower
casein 12 0 0 0 0 0
horsebean 0 10 0 0 0 0
linseed 0 0 12 0 0 0
meatmeal 0 0 0 11 0 0
soybean 0 0 0 0 14 0
sunflower 0 0 0 0 0 12
```

The first argument of `table()`

(`feed`

) is the rows, and the second (`feed3`

) is the columns. We can see that wherever we had “meatmeal” or “soybean” in `feed`

, we have “Company B” or “Company A” in `feed3`

, respectively.

### 9.4.1 Make Levels Missing

Missing values can be coded in a variety of ways. To turn a level of a factor into missing, recode it as `NULL`

.

All the corresponding values are converted to `NA`

, and the level we made missing is removed from the `levels()`

output.

```
y_miss <- fct_recode(y, NULL = "b")
y_miss
[1] a <NA> c <NA> d a a
Levels: a c d
as.numeric(y_miss)
[1] 1 NA 2 NA 3 1 1
levels(y_miss)
[1] "a" "c" "d"
```

### 9.4.2 Numeric Categorical Variables

Sometimes, variables appear to be continuous, numeric variables, but they are actually categorical variables.

An example of this is if we have Likert scale data. Responses may range 1-5 and represent level of agreement. Some may argue that we can treat such a variable as continuous, but for now we will force it to be categorical.

`survey <- sample(1:5, 100, replace = T)`

If we were to pass `survey`

to `fct_recode()`

, we will get an error:

```
survey <- fct_recode(survey,
"Strongly agree" = 1,
"Agree" = 2,
"Neither agree nor disagree" = 3,
"Disagree" = 4,
"Strongly disagree" = 5)
```

```
Error in `fct_recode()`:
! `.f` must be a factor or character vector, not an integer vector
```

This is because `fct_recode()`

(as well as all the other `fct_*()`

functions) require a character or factor vector. If we want to manipulate a numeric vector, first coerce it to a character, and then recode it. We need to be sure to quote the right half of each of our recoding pairs, since `survey`

’s values are now character (e.g., `"1"`

) rather than numeric (`1`

).

```
survey <- fct_recode(as.character(survey),
"Strongly agree" = "1",
"Agree" = "2",
"Neither agree nor disagree" = "3",
"Disagree" = "4",
"Strongly disagree" = "5")
```

## 9.5 Collapsing

Another factor manipulation is reducing the number of levels, called collapsing. Sometimes we have a factor with many levels, but very few observations exist at some levels. This can cause problems in estimation, especially in logistic regression models. We can combine levels with few observations together.

We can do this with `fct_collapse()`

, and our arguments follow the pattern `new_level = c(current_level_1, current_level_2, ...)`

.

We can combine levels from `y`

so that “a” is in a new level called “vowel”, and the other three letters (“b”, “c”, and “d”) are combined into a group called “consonant.”

```
y_collapse <- fct_collapse(y,
"vowel" = "a",
"consonant" = c("b", "c", "d"))
```

Our levels for `y_collapse`

are now as follows:

Integer | Label |
---|---|

1 | vowel |

2 | consonant |

This process is irreversible since we have lost data. We could change “vowel” back into “a” with `fct_recode()`

since this level of `y_collapse`

is only composed of one level from `y`

. However, because “consonant” is a composite of multiple levels, it is impossible to separate this level back into “b”, “c”, and “d”. Therefore, it is a good idea to create a new object or variable when collapsing a factor.

Compare the output and structure of `y`

and `y_collapse`

to see how the data has changed:

```
y
[1] a b c b d a a
Levels: a b c d
y_collapse
[1] vowel consonant consonant consonant consonant vowel vowel
Levels: vowel consonant
as.numeric(y)
[1] 1 2 3 2 4 1 1
as.numeric(y_collapse)
[1] 1 2 2 2 2 1 1
levels(y)
[1] "a" "b" "c" "d"
levels(y_collapse)
[1] "vowel" "consonant"
```

After collapsing, we should compare the original and new factors to verify the coding worked as intended:

`table(y, y_collapse)`

```
y_collapse
y vowel consonant
a 3 0
b 0 2
c 0 1
d 0 1
```

All the values of “a” in `y`

correspond to “vowel” in `y_collapse`

, while “b”, “c”, and “d” correspond to “consonant.”

### 9.5.1 Empty Levels

If we take a subset of our data, the levels data for factor variables remains unchanged, even if we have excluded all observations at a certain level.

Take a subset of the first five values of `chickwts`

and store them as `chickwts_subset`

.

`chickwts_subset <- chickwts[1:5, ]`

The levels data for `chickwts_subset`

still includes all six types of feed, even though we only have “horsebean” in our subset.

```
chickwts_subset$feed
[1] horsebean horsebean horsebean horsebean horsebean
Levels: casein horsebean linseed meatmeal soybean sunflower
as.numeric(chickwts_subset$feed)
[1] 2 2 2 2 2
levels(chickwts_subset$feed)
[1] "casein" "horsebean" "linseed" "meatmeal" "soybean" "sunflower"
```

This can have some annoying consequences for some plots, where levels with no observations are still included:

`plot(weight ~ feed, chickwts_subset)`

To remove these extra levels, simply use `factor()`

to “reset” the factor and drop unused levels:

`chickwts_subset$feed <- factor(chickwts_subset$feed)`

Now verify that these levels were removed:

```
chickwts_subset$feed
[1] horsebean horsebean horsebean horsebean horsebean
Levels: horsebean
as.numeric(chickwts_subset$feed)
[1] 1 1 1 1 1
levels(chickwts_subset$feed)
[1] "horsebean"
plot(weight ~ feed, chickwts_subset)
```

## 9.6 Cheat Sheet

A simple cheat sheet for factor manipulation is given below. It has examples of `fct_relevel()`

, `fct_recode()`

, and `fct_collapse()`

with the `y`

vector, showing the integer vector and integer-label mapping after each operation.

## 9.7 Exercises

Releveling: Using the

`iris`

dataset, plot counts by factor level with`plot(iris$Species)`

. Now, relevel`Species`

so that`versicolor`

is the reference (first) category. Plot it again. What do you notice?Recoding: In the

`mtcars`

data, all the variables are numeric. Convert`vs`

to a factor, where 0 has the label “V-shaped” and 1 has the label “Straight”.Collapsing:

`mtcars$cyl`

has three different values: 4, 6, and 8. Convert it into a two-level factor, where 4 and 6 share the label “Few” and 8 has the label “Many”.