19 Reshaping
Reshaping involves changing how data is organized by moving information between rows and columns. The reason we need to reshape our data and move flexibly between wide and long formats is that certain data wrangling operations and model specifications are easier done in either format.
The tidyr
package has functions for reshaping data, so load that in addition to dplyr
for data manipulation functions, as well as a fresh airquality
dataset.
library(tidyr)
library(dplyr)
air <- airquality
19.1 Data “Shapes”
Dataframes can be organized in different ways for different purposes. Dataframes often come in less-than-ideal formats, especially when you are using secondary data. It is important to know how to rearrange the data to prepare it for tables or for plotting with ggplot2
. Data comes in two primary shapes: wide and long.
19.1.1 Wide Data
Data is wide when a row has more than one observation, and the units of observation (e.g., individuals, countries, households) are on one row each.
You might run into this format if you work with survey or assessment data, or if you have ever downloaded data from Qualtrics. In the example below, each row corresponds to a single person, and each column is a different observation for that person.
ID | Income2000 | Income2001 | Income2002 | … |
---|---|---|---|---|
001 | 50000 | 52000 | 56000 | |
002 | 0 | 30000 | 31000 | |
003 | 6800 | 6400 | 6850 | |
… |
19.1.2 Long Data
Data is long when a row has only one observation, but the units of observation are repeated down a column.
Longitudinal data is often in the long format. You might have a column where ID numbers are repeated, a column marking when each data point was observed, and another column with observed values.
Here is the same dataset as above, now arranged into a long format:
ID | Year | Income |
---|---|---|
001 | 2000 | 50000 |
001 | 2001 | 52000 |
001 | 2002 | 56000 |
002 | 2000 | 0 |
002 | 2001 | 30000 |
002 | 2002 | 31000 |
003 | 2000 | 6800 |
003 | 2001 | 6400 |
003 | 2002 | 6850 |
… |
19.1.3 The Shape of airquality
Is the air
dataframe in wide or long format?
head(air)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
Our answer to that question depends on what variables we are interested in and how we conceive of our data.
Does the air
dataframe contain multiple observations (Ozone
, Solar.R
, Wind
, Temp
) of interest per row, and do we conceive of Day
as the unit of observation? If so, this is a wide dataframe since we have multiple observations per row (Day
).
Or, are we more interested in just one of the variables (such as Temp
), and do we think of Month
as the unit of observation? If so, then air
is in long format since Month
is repeated down its column across observations.
19.2 Making Long Data Wide
If our data is long, we can reshape (or “pivot”) it into a wide format with the aptly named pivot_wider()
function.
First, select the units of observation and the column where the observed values lie, and pass these to pivot_wider()
.
air_wide <-
air |>
select(Temp, Month, Day) |>
pivot_wider(names_from = Day,
values_from = Temp)
Take a moment to open air
and air_wide
in the Viewer to see what just happened.
The resulting dataframe, air_wide
, is not ideal. We specified that the column names should come from our Day
column. Day
was numeric, so now we have column names that are numbers.
Numbers are “non-syntactic” object names, so we have to set them off with backticks (`
). In this case, it is especially confusing because the column named 1
is the 2nd column.
air_wide$`1`
[1] 67 78 84 81 91
air_wide[, 2]
# A tibble: 5 × 1
`1`
<int>
1 67
2 78
3 84
4 81
5 91
To fix this, we can prefix the resulting column names with the word “Day” with the names_prefix
argument.
air_wide <-
air |>
select(Temp, Month, Day) |>
pivot_wider(names_from = Day,
values_from = Temp,
names_prefix = "Day")
Now the column names are a bit easier to handle and call. Day 1 is now named Day1
, day 2 is Day2
, and so on.
19.2.1 New Missing Data
air
contains months of different lengths: June and September have 30 days each. In the un-modified air
dataframe, there are no rows for June 31 or September 31. Only the months with 31 days (May, July, August) have rows corresponding to day 31.
air |> filter(Day %in% 31)
Ozone Solar.R Wind Temp Month Day
1 37 279 7.4 76 5 31
2 59 254 9.2 81 7 31
3 85 188 6.3 94 8 31
The Temp
column of air
does not have any missing data.
air |> select(Temp) |> is.na() |> sum()
[1] 0
However, after making the dataframe wide, Day31
was filled in with NA
for June and September.
air_wide |> select(Month, Day31)
# A tibble: 5 × 2
Month Day31
<int> <int>
1 5 76
2 6 NA
3 7 81
4 8 94
5 9 NA
Recall that a dataframe is a series of same-length vectors. Even though day 31 only had three observed values, the length of the Day31
column had to be five to match the lengths of the other columns. NA
values were supplied to fill in the gaps.
The same thing would happen if data were also collected for the first ten days of October. Days 11-31 would be filled in with NA
when converting to wide format.
19.3 Making Wide Data Long
Alternatively, if a dataframe is in the wide format, it can be converted into the long format with the function pivot_longer()
.
Supply the cols
argument with a vector (c()
) or range (with :
) of columns containing observations. Quotes are not needed. (The cols
argument can also take other selection functions, such as starts_with()
or where()
. See the chapter on Subsetting for more.)
air_long <-
air |>
pivot_longer(cols = Ozone:Temp,
names_to = "Variable",
values_to = "Value")
The resulting Variable
column contains the names of the different measurement variables, and Value
contains the observed values. In this format, it is not appropriate to calculate mean(air_long$Value)
because this column contains values for different variables.
Instead, subset by the Variable
column to compute summary statistics.
air_long |>
filter(Variable %in% "Temp") |>
summarize(TempAvg = mean(Value))
# A tibble: 1 × 1
TempAvg
<dbl>
1 77.9
Or, group by Month
and Variable
to quickly calculate mean values by month for each variable, remembering to set na.rm
to TRUE
because of missing data.
air_long |>
group_by(Month, Variable) |>
summarize(Avg = mean(Value, na.rm = T))
`summarise()` has grouped output by 'Month'. You can override using the `.groups` argument.
# A tibble: 20 × 3
# Groups: Month [5]
Month Variable Avg
<int> <chr> <dbl>
1 5 Ozone 23.6
2 5 Solar.R 181.
3 5 Temp 65.5
4 5 Wind 11.6
5 6 Ozone 29.4
6 6 Solar.R 190.
7 6 Temp 79.1
8 6 Wind 10.3
9 7 Ozone 59.1
10 7 Solar.R 216.
11 7 Temp 83.9
12 7 Wind 8.94
13 8 Ozone 60.0
14 8 Solar.R 172.
15 8 Temp 84.0
16 8 Wind 8.79
17 9 Ozone 31.4
18 9 Solar.R 167.
19 9 Temp 76.9
20 9 Wind 10.2
When pivoting the data to the wide format above, NA
was supplied to make the dataframe rectangular. In contrast, when converting to long format, rows with no observations are preserved by default. To change this, set the argument values_drop_na
to TRUE
. (Its default is FALSE
, as can be seen in help(pivot_longer)
.)
nrow(air_long)
[1] 612
air_long |> filter(is.na(Value))
# A tibble: 44 × 4
Month Day Variable Value
<int> <int> <chr> <dbl>
1 5 5 Ozone NA
2 5 5 Solar.R NA
3 5 6 Solar.R NA
4 5 10 Ozone NA
5 5 11 Solar.R NA
6 5 25 Ozone NA
7 5 26 Ozone NA
8 5 27 Ozone NA
9 5 27 Solar.R NA
10 6 1 Ozone NA
# ℹ 34 more rows
air_long <-
air |>
pivot_longer(cols = Ozone:Temp,
names_to = "Variable",
values_to = "Value",
values_drop_na = TRUE)
nrow(air_long)
[1] 568
air_long |> filter(is.na(Value))
# A tibble: 0 × 4
# ℹ 4 variables: Month <int>, Day <int>, Variable <chr>, Value <dbl>
19.4 Exercises
Convert the
WorldPhones
dataset into a dataframe, and add a column called “Year” from its row names. Then reshape it into a long format. When reshaping, be sure to name the new columns appropriately.Reshape
ChickWeight
into a wide format with columns created fromTime
.
19.5 Data Wrangling Exercises
Putting everything together now,
Reshape
us_rent_income
(from thetidyr
package) so that it has one line per state, and two new columns namedestimate_income
andestimate_rent
that contain values fromestimate
.Merge this with
state.x77
, and keep all rows. Then, drop rows where any values are missing. You can do this in one or two steps.Add a column containing
state.division
.Add a column with the proportion of income spent on rent (rent / income).
Drop rows where
Area
is not greater than ten timesFrost
.Replace all spaces in all column names with dashes (e.g.,
HS Grad
toHS-Grad
).Without removing any rows, add a column with the population-weighted mean rent by geographic division.
- Which division has the highest mean rent?
- Save the resulting dataframe as a CSV file, a tab-delimited text file, and an RDS file.