19 Reshaping

Reshaping involves changing how data is organized by moving information between rows and columns. The reason we need to reshape our data and move flexibly between wide and long formats is that certain data wrangling operations and model specifications are easier done in either format.

The tidyr package has functions for reshaping data, so load that in addition to dplyr for data manipulation functions, as well as a fresh airquality dataset.

library(tidyr)
library(dplyr)

air <- airquality

19.1 Data “Shapes”

Dataframes can be organized in different ways for different purposes. Dataframes often come in less-than-ideal formats, especially when you are using secondary data. It is important to know how to rearrange the data to prepare it for tables or for plotting with ggplot2. Data comes in two primary shapes: wide and long.

19.1.1 Wide Data

Data is wide when a row has more than one observation, and the units of observation (e.g., individuals, countries, households) are on one row each.

You might run into this format if you work with survey or assessment data, or if you have ever downloaded data from Qualtrics. In the example below, each row corresponds to a single person, and each column is a different observation for that person.

ID	Income2000	Income2001	Income2002
001	50000	52000	56000
002	0	30000	31000
003	6800	6400	6850
…

19.1.2 Long Data

Data is long when a row has only one observation, but the units of observation are repeated down a column.

Longitudinal data is often in the long format. You might have a column where ID numbers are repeated, a column marking when each data point was observed, and another column with observed values.

Here is the same dataset as above, now arranged into a long format:

ID	Year	Income
001	2000	50000
001	2001	52000
001	2002	56000
002	2000	0
002	2001	30000
002	2002	31000
003	2000	6800
003	2001	6400
003	2002	6850
…

19.1.3 The Shape of `airquality`

Is the air dataframe in wide or long format?

head(air)

  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Our answer to that question depends on what variables we are interested in and how we conceive of our data.

Does the air dataframe contain multiple observations (Ozone, Solar.R, Wind, Temp) of interest per row, and do we conceive of Day as the unit of observation? If so, this is a wide dataframe since we have multiple observations per row (Day).

Or, are we more interested in just one of the variables (such as Temp), and do we think of Month as the unit of observation? If so, then air is in long format since Month is repeated down its column across observations.

19.2 Making Long Data Wide

If our data is long, we can reshape (or “pivot”) it into a wide format with the aptly named pivot_wider() function.

First, select the units of observation and the column where the observed values lie, and pass these to pivot_wider().

air_wide <- 
  air |> 
  select(Temp, Month, Day) |>
  pivot_wider(names_from = Day, 
              values_from = Temp)

Take a moment to open air and air_wide in the Viewer to see what just happened.

The resulting dataframe, air_wide, is not ideal. We specified that the column names should come from our Day column. Day was numeric, so now we have column names that are numbers.

Numbers are “non-syntactic” object names, so we have to set them off with backticks (`). In this case, it is especially confusing because the column named 1 is the 2nd column.

air_wide$`1`

[1] 67 78 84 81 91

air_wide[, 2]

# A tibble: 5 × 1
    `1`
  <int>
1    67
2    78
3    84
4    81
5    91

To fix this, we can prefix the resulting column names with the word “Day” with the names_prefix argument.

air_wide <- 
  air |> 
  select(Temp, Month, Day) |>
  pivot_wider(names_from = Day, 
              values_from = Temp,
              names_prefix = "Day")

Now the column names are a bit easier to handle and call. Day 1 is now named Day1, day 2 is Day2, and so on.

19.2.1 New Missing Data

air contains months of different lengths: June and September have 30 days each. In the un-modified air dataframe, there are no rows for June 31 or September 31. Only the months with 31 days (May, July, August) have rows corresponding to day 31.

air |> filter(Day %in% 31)

  Ozone Solar.R Wind Temp Month Day
1    37     279  7.4   76     5  31
2    59     254  9.2   81     7  31
3    85     188  6.3   94     8  31

The Temp column of air does not have any missing data.

air |> select(Temp) |> is.na() |> sum()

[1] 0

However, after making the dataframe wide, Day31 was filled in with NA for June and September.

air_wide |> select(Month, Day31)

# A tibble: 5 × 2
  Month Day31
  <int> <int>
1     5    76
2     6    NA
3     7    81
4     8    94
5     9    NA

Recall that a dataframe is a series of same-length vectors. Even though day 31 only had three observed values, the length of the Day31 column had to be five to match the lengths of the other columns. NA values were supplied to fill in the gaps.

The same thing would happen if data were also collected for the first ten days of October. Days 11-31 would be filled in with NA when converting to wide format.

19.3 Making Wide Data Long

Alternatively, if a dataframe is in the wide format, it can be converted into the long format with the function pivot_longer().

Supply the cols argument with a vector (c()) or range (with :) of columns containing observations. Quotes are not needed. (The cols argument can also take other selection functions, such as starts_with() or where(). See the chapter on Subsetting for more.)

air_long <- 
  air |> 
  pivot_longer(cols = Ozone:Temp, 
               names_to = "Variable", 
               values_to = "Value")

The resulting Variable column contains the names of the different measurement variables, and Value contains the observed values. In this format, it is not appropriate to calculate mean(air_long$Value) because this column contains values for different variables.

Instead, subset by the Variable column to compute summary statistics.

air_long |> 
  filter(Variable %in% "Temp") |> 
  summarize(TempAvg = mean(Value))

# A tibble: 1 × 1
  TempAvg
    <dbl>
1    77.9

Or, group by Month and Variable to quickly calculate mean values by month for each variable, remembering to set na.rm to TRUE because of missing data.

air_long |> 
  group_by(Month, Variable) |> 
  summarize(Avg = mean(Value, na.rm = T))

`summarise()` has grouped output by 'Month'. You can override using the `.groups` argument.

# A tibble: 20 × 3
# Groups:   Month [5]
   Month Variable    Avg
   <int> <chr>     <dbl>
 1     5 Ozone     23.6 
 2     5 Solar.R  181.  
 3     5 Temp      65.5 
 4     5 Wind      11.6 
 5     6 Ozone     29.4 
 6     6 Solar.R  190.  
 7     6 Temp      79.1 
 8     6 Wind      10.3 
 9     7 Ozone     59.1 
10     7 Solar.R  216.  
11     7 Temp      83.9 
12     7 Wind       8.94
13     8 Ozone     60.0 
14     8 Solar.R  172.  
15     8 Temp      84.0 
16     8 Wind       8.79
17     9 Ozone     31.4 
18     9 Solar.R  167.  
19     9 Temp      76.9 
20     9 Wind      10.2

When pivoting the data to the wide format above, NA was supplied to make the dataframe rectangular. In contrast, when converting to long format, rows with no observations are preserved by default. To change this, set the argument values_drop_na to TRUE. (Its default is FALSE, as can be seen in help(pivot_longer).)

nrow(air_long)

[1] 612

air_long |> filter(is.na(Value))

# A tibble: 44 × 4
   Month   Day Variable Value
   <int> <int> <chr>    <dbl>
 1     5     5 Ozone       NA
 2     5     5 Solar.R     NA
 3     5     6 Solar.R     NA
 4     5    10 Ozone       NA
 5     5    11 Solar.R     NA
 6     5    25 Ozone       NA
 7     5    26 Ozone       NA
 8     5    27 Ozone       NA
 9     5    27 Solar.R     NA
10     6     1 Ozone       NA
# ℹ 34 more rows

air_long <- 
  air |> 
  pivot_longer(cols = Ozone:Temp, 
               names_to = "Variable", 
               values_to = "Value",
               values_drop_na = TRUE)

nrow(air_long)

[1] 568

air_long |> filter(is.na(Value))

# A tibble: 0 × 4
# ℹ 4 variables: Month <int>, Day <int>, Variable <chr>, Value <dbl>

19.4 Exercises

Convert the WorldPhones dataset into a dataframe, and add a column called “Year” from its row names. Then reshape it into a long format. When reshaping, be sure to name the new columns appropriately.
Reshape ChickWeight into a wide format with columns created from Time.

19.5 Data Wrangling Exercises

Putting everything together now,

Reshape us_rent_income (from the tidyr package) so that it has one line per state, and two new columns named estimate_income and estimate_rent that contain values from estimate.
Merge this with state.x77, and keep all rows. Then, drop rows where any values are missing. You can do this in one or two steps.
Add a column containing state.division.
Add a column with the proportion of income spent on rent (rent / income).
Drop rows where Area is not greater than ten times Frost.
Replace all spaces in all column names with dashes (e.g., HS Grad to HS-Grad).
Without removing any rows, add a column with the population-weighted mean rent by geographic division.

Which division has the highest mean rent?

Save the resulting dataframe as a CSV file, a tab-delimited text file, and an RDS file.