19 Reshaping

Reshaping involves changing how data is organized by moving information between rows and columns. The reason we need to reshape our data and move flexibly between wide and long formats is that certain data wrangling operations and model specifications are easier done in either format.

The tidyr package has functions for reshaping data, so load that in addition to dplyr for data manipulation functions, as well as a fresh airquality dataset.

library(tidyr)
library(dplyr)

air <- airquality

19.1 Data “Shapes”

Dataframes can be organized in different ways for different purposes. Dataframes often come in less-than-ideal formats, especially when you are using secondary data. It is important to know how to rearrange the data to prepare it for tables or for plotting with ggplot2. Data comes in two primary shapes: wide and long.

19.1.1 Wide Data

Data is wide when a row has more than one observation, and the units of observation (e.g., individuals, countries, households) are on one row each.

You might run into this format if you work with survey or assessment data, or if you have ever downloaded data from Qualtrics. In the example below, each row corresponds to a single person, and each column is a different observation for that person.

ID Income2000 Income2001 Income2002
001 50000 52000 56000
002 0 30000 31000
003 6800 6400 6850

19.1.2 Long Data

Data is long when a row has only one observation, but the units of observation are repeated down a column.

Longitudinal data is often in the long format. You might have a column where ID numbers are repeated, a column marking when each data point was observed, and another column with observed values.

Here is the same dataset as above, now arranged into a long format:

ID Year Income
001 2000 50000
001 2001 52000
001 2002 56000
002 2000 0
002 2001 30000
002 2002 31000
003 2000 6800
003 2001 6400
003 2002 6850

19.1.3 The Shape of airquality

Is the air dataframe in wide or long format?

head(air)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Our answer to that question depends on what variables we are interested in and how we conceive of our data.

Does the air dataframe contain multiple observations (Ozone, Solar.R, Wind, Temp) of interest per row, and do we conceive of Day as the unit of observation? If so, this is a wide dataframe since we have multiple observations per row (Day).

Or, are we more interested in just one of the variables (such as Temp), and do we think of Month as the unit of observation? If so, then air is in long format since Month is repeated down its column across observations.

19.2 Making Long Data Wide

If our data is long, we can reshape (or “pivot”) it into a wide format with the aptly named pivot_wider() function.

First, select the units of observation and the column where the observed values lie, and pass these to pivot_wider().

air_wide <- 
  air |> 
  select(Temp, Month, Day) |>
  pivot_wider(names_from = Day, 
              values_from = Temp)

Take a moment to open air and air_wide in the Viewer to see what just happened.

The resulting dataframe, air_wide, is not ideal. We specified that the column names should come from our Day column. Day was numeric, so now we have column names that are numbers.

Numbers are “non-syntactic” object names, so we have to set them off with backticks (`). In this case, it is especially confusing because the column named 1 is the 2nd column.

air_wide$`1`
[1] 67 78 84 81 91
air_wide[, 2]
# A tibble: 5 × 1
    `1`
  <int>
1    67
2    78
3    84
4    81
5    91

To fix this, we can prefix the resulting column names with the word “Day” with the names_prefix argument.

air_wide <- 
  air |> 
  select(Temp, Month, Day) |>
  pivot_wider(names_from = Day, 
              values_from = Temp,
              names_prefix = "Day")

Now the column names are a bit easier to handle and call. Day 1 is now named Day1, day 2 is Day2, and so on.

19.2.1 New Missing Data

air contains months of different lengths: June and September have 30 days each. In the un-modified air dataframe, there are no rows for June 31 or September 31. Only the months with 31 days (May, July, August) have rows corresponding to day 31.

air |> filter(Day %in% 31)
  Ozone Solar.R Wind Temp Month Day
1    37     279  7.4   76     5  31
2    59     254  9.2   81     7  31
3    85     188  6.3   94     8  31

The Temp column of air does not have any missing data.

air |> select(Temp) |> is.na() |> sum()
[1] 0

However, after making the dataframe wide, Day31 was filled in with NA for June and September.

air_wide |> select(Month, Day31)
# A tibble: 5 × 2
  Month Day31
  <int> <int>
1     5    76
2     6    NA
3     7    81
4     8    94
5     9    NA

Recall that a dataframe is a series of same-length vectors. Even though day 31 only had three observed values, the length of the Day31 column had to be five to match the lengths of the other columns. NA values were supplied to fill in the gaps.

The same thing would happen if data were also collected for the first ten days of October. Days 11-31 would be filled in with NA when converting to wide format.

19.3 Making Wide Data Long

Alternatively, if a dataframe is in the wide format, it can be converted into the long format with the function pivot_longer().

Supply the cols argument with a vector (c()) or range (with :) of columns containing observations. Quotes are not needed. (The cols argument can also take other selection functions, such as starts_with() or where(). See the chapter on Subsetting for more.)

air_long <- 
  air |> 
  pivot_longer(cols = Ozone:Temp, 
               names_to = "Variable", 
               values_to = "Value")

The resulting Variable column contains the names of the different measurement variables, and Value contains the observed values. In this format, it is not appropriate to calculate mean(air_long$Value) because this column contains values for different variables.

Instead, subset by the Variable column to compute summary statistics.

air_long |> 
  filter(Variable %in% "Temp") |> 
  summarize(TempAvg = mean(Value))
# A tibble: 1 × 1
  TempAvg
    <dbl>
1    77.9

Or, group by Month and Variable to quickly calculate mean values by month for each variable, remembering to set na.rm to TRUE because of missing data.

air_long |> 
  group_by(Month, Variable) |> 
  summarize(Avg = mean(Value, na.rm = T))
`summarise()` has grouped output by 'Month'. You can override using the `.groups` argument.
# A tibble: 20 × 3
# Groups:   Month [5]
   Month Variable    Avg
   <int> <chr>     <dbl>
 1     5 Ozone     23.6 
 2     5 Solar.R  181.  
 3     5 Temp      65.5 
 4     5 Wind      11.6 
 5     6 Ozone     29.4 
 6     6 Solar.R  190.  
 7     6 Temp      79.1 
 8     6 Wind      10.3 
 9     7 Ozone     59.1 
10     7 Solar.R  216.  
11     7 Temp      83.9 
12     7 Wind       8.94
13     8 Ozone     60.0 
14     8 Solar.R  172.  
15     8 Temp      84.0 
16     8 Wind       8.79
17     9 Ozone     31.4 
18     9 Solar.R  167.  
19     9 Temp      76.9 
20     9 Wind      10.2 

When pivoting the data to the wide format above, NA was supplied to make the dataframe rectangular. In contrast, when converting to long format, rows with no observations are preserved by default. To change this, set the argument values_drop_na to TRUE. (Its default is FALSE, as can be seen in help(pivot_longer).)

nrow(air_long)
[1] 612
air_long |> filter(is.na(Value))
# A tibble: 44 × 4
   Month   Day Variable Value
   <int> <int> <chr>    <dbl>
 1     5     5 Ozone       NA
 2     5     5 Solar.R     NA
 3     5     6 Solar.R     NA
 4     5    10 Ozone       NA
 5     5    11 Solar.R     NA
 6     5    25 Ozone       NA
 7     5    26 Ozone       NA
 8     5    27 Ozone       NA
 9     5    27 Solar.R     NA
10     6     1 Ozone       NA
# ℹ 34 more rows
air_long <- 
  air |> 
  pivot_longer(cols = Ozone:Temp, 
               names_to = "Variable", 
               values_to = "Value",
               values_drop_na = TRUE)

nrow(air_long)
[1] 568
air_long |> filter(is.na(Value))
# A tibble: 0 × 4
# ℹ 4 variables: Month <int>, Day <int>, Variable <chr>, Value <dbl>

19.4 Exercises

  1. Convert the WorldPhones dataset into a dataframe, and add a column called “Year” from its row names. Then reshape it into a long format. When reshaping, be sure to name the new columns appropriately.

  2. Reshape ChickWeight into a wide format with columns created from Time.

19.5 Data Wrangling Exercises

Putting everything together now,

  1. Reshape us_rent_income (from the tidyr package) so that it has one line per state, and two new columns named estimate_income and estimate_rent that contain values from estimate.

  2. Merge this with state.x77, and keep all rows. Then, drop rows where any values are missing. You can do this in one or two steps.

  3. Add a column containing state.division.

  4. Add a column with the proportion of income spent on rent (rent / income).

  5. Drop rows where Area is not greater than ten times Frost.

  6. Replace all spaces in all column names with dashes (e.g., HS Grad to HS-Grad).

  7. Without removing any rows, add a column with the population-weighted mean rent by geographic division.

  • Which division has the highest mean rent?
  1. Save the resulting dataframe as a CSV file, a tab-delimited text file, and an RDS file.