Reshaping involves changing how data is organized by moving information between rows and columns. The reason we need to reshape our data and move flexibly between wide and long formats is that certain data wrangling operations and model specifications are easier done in either format.
tidyr package has functions for reshaping data, so load that in addition to
dplyr for data manipulation functions, as well as a fresh
library(tidyr) library(dplyr) air <- airquality
Dataframes can be organized in different ways for different purposes. Dataframes often come in less-than-ideal formats, especially when you are using secondary data. It is important to know how to rearrange the data to prepare it for tables or for plotting with
ggplot2. Data comes in two primary shapes: wide and long.
Data is wide when a row has more than one observation, and the units of observation (e.g., individuals, countries, households) are on one row each.
You might run into this format if you work with survey or assessment data, or if you have ever downloaded data from Qualtrics. In the example below, each row corresponds to a single person, and each column is a different observation for that person.
Data is long when a row has only one observation, but the units of observation are repeated down a column.
Longitudinal data is often in the long format. You might have a column where ID numbers are repeated, a column marking when each data point was observed, and another column with observed values.
Here is the same dataset as above, now arranged into a long format:
air dataframe in wide or long format?
Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6
Our answer to that question depends on what variables we are interested in and how we conceive of our data.
air dataframe contain multiple observations (
Temp) of interest per row, and do we conceive of
Day as the unit of observation? If so, this is a wide dataframe since we have multiple observations per row (
Or, are we more interested in just one of the variables (such as
Temp), and do we think of
Month as the unit of observation? If so, then
air is in long format since
Month is repeated down its column across observations.
If our data is long, we can reshape (or “pivot”) it into a wide format with the aptly named
First, select the units of observation and the column where the observed values lie, and pass these to
air_wide <- air |> select(Temp, Month, Day) |> pivot_wider(names_from = Day, values_from = Temp)
Take a moment to open
air_wide in the Viewer to see what just happened.
The resulting dataframe,
air_wide, is not ideal. We specified that the column names should come from our
Day was numeric, so now we have column names that are numbers.
Numbers are “non-syntactic” object names, so we have to set them off with backticks (
`). In this case, it is especially confusing because the column named
1 is the 2nd column.
 67 78 84 81 91
# A tibble: 5 x 1 `1` <int> 1 67 2 78 3 84 4 81 5 91
To fix this, we can prefix the resulting column names with the word “Day” with the
air_wide <- air |> select(Temp, Month, Day) |> pivot_wider(names_from = Day, values_from = Temp, names_prefix = "Day")
Now the column names are a bit easier to handle and call. Day 1 is now named
Day1, day 2 is
Day2, and so on.
air contains months of different lengths: June and September have 30 days each. In the un-modified
air dataframe, there are no rows for June 31 or September 31. Only the months with 31 days (May, July, August) have rows corresponding to day 31.
air |> filter(Day %in% 31)
Ozone Solar.R Wind Temp Month Day 1 37 279 7.4 76 5 31 2 59 254 9.2 81 7 31 3 85 188 6.3 94 8 31
Temp column of
air does not have any missing data.
air |> select(Temp) |> is.na() |> sum()
However, after making the dataframe wide,
Day31 was filled in with
NA for June and September.
air_wide |> select(Month, Day31)
# A tibble: 5 x 2 Month Day31 <int> <int> 1 5 76 2 6 NA 3 7 81 4 8 94 5 9 NA
Recall that a dataframe is a series of same-length vectors. Even though day 31 only had three observed values, the length of the
Day31 column had to be five to match the lengths of the other columns.
NA values were supplied to fill in the gaps.
The same thing would happen if data were also collected for the first ten days of October. Days 11-31 would be filled in with
NA when converting to wide format.
Alternatively, if a dataframe is in the wide format, it can be converted into the long format with the function
cols argument with a vector (
c()) or range (with
:) of columns containing observations. Quotes are not needed. (The
cols argument can also take other selection functions, such as
where(). See the chapter on Subsetting for more.)
air_long <- air |> pivot_longer(cols = Ozone:Temp, names_to = "Variable", values_to = "Value")
Variable column contains the names of the different measurement variables, and
Value contains the observed values. In this format, it is not appropriate to calculate
mean(air_long$Value) because this column contains values for different variables.
Instead, subset by the
Variable column to compute summary statistics.
air_long |> filter(Variable %in% "Temp") |> summarize(TempAvg = mean(Value))
# A tibble: 1 x 1 TempAvg <dbl> 1 77.9
Or, group by
Variable to quickly calculate mean values by month for each variable, remembering to set
TRUE because of missing data.
air_long |> group_by(Month, Variable) |> summarize(Avg = mean(Value, na.rm = T))
`summarise()` has grouped output by 'Month'. You can override using the `.groups` argument.
# A tibble: 20 x 3 # Groups: Month  Month Variable Avg <int> <chr> <dbl> 1 5 Ozone 23.6 2 5 Solar.R 181. 3 5 Temp 65.5 4 5 Wind 11.6 5 6 Ozone 29.4 6 6 Solar.R 190. 7 6 Temp 79.1 8 6 Wind 10.3 9 7 Ozone 59.1 10 7 Solar.R 216. 11 7 Temp 83.9 12 7 Wind 8.94 13 8 Ozone 60.0 14 8 Solar.R 172. 15 8 Temp 84.0 16 8 Wind 8.79 17 9 Ozone 31.4 18 9 Solar.R 167. 19 9 Temp 76.9 20 9 Wind 10.2
When pivoting the data to the wide format above,
NA was supplied to make the dataframe rectangular. In contrast, when converting to long format, rows with no observations are preserved by default. To change this, set the argument
TRUE. (Its default is
FALSE, as can be seen in
air_long |> filter(is.na(Value))
# A tibble: 44 x 4 Month Day Variable Value <int> <int> <chr> <dbl> 1 5 5 Ozone NA 2 5 5 Solar.R NA 3 5 6 Solar.R NA 4 5 10 Ozone NA 5 5 11 Solar.R NA 6 5 25 Ozone NA 7 5 26 Ozone NA 8 5 27 Ozone NA 9 5 27 Solar.R NA 10 6 1 Ozone NA # ... with 34 more rows
air_long <- air |> pivot_longer(cols = Ozone:Temp, names_to = "Variable", values_to = "Value", values_drop_na = TRUE) nrow(air_long)
air_long |> filter(is.na(Value))
# A tibble: 0 x 4 # ... with 4 variables: Month <int>, Day <int>, Variable <chr>, Value <dbl>
WorldPhonesdataset into a dataframe, and add a column called “Year” from its row names. Then reshape it into a long format. When reshaping, be sure to name the new columns appropriately.
ChickWeightinto a wide format with columns created from
Putting everything together now,
tidyrpackage) so that it has one line per state, and two new columns named
estimate_rentthat contain values from
Merge this with
state.x77, and keep all rows. Then, drop rows where any values are missing. You can do this in one or two steps.
Add a column containing
Add a column with the proportion of income spent on rent (rent / income).
Drop rows where
Areais not greater than ten times
Replace all spaces in all column names with dashes (e.g.,
Without removing any rows, add a column with the population-weighted mean rent by geographic division.
- Which division has the highest mean rent?
- Save the resulting dataframe as a CSV file, a tab-delimited text file, and an RDS file.