5.7 Relationships between columns
5.7.1 Data concepts
Columns can contain related data. For example, a patient record may contain a set of indicator variables that indicate different chronic conditions. If you want to know if a patient has no chronic conditions, you would need to test each of the indicator variables.
Tasks that consider multiple related variables often look for a particular condition in the set of variables. In the example above, the common condition was a true value of an indicator variable. Some results of testing for common conditions are: none of the columns meet the particular condition, at least one column meets the particular condition, or all of the column meets all particular condition.
5.7.2 Examples R
These examples use the cps1.csv
data set.
We begin by loading the tidyverse, importing the csv file, and naming variables.
library(tidyverse)
cps1_path <- file.path("..", "datasets", "cps1.csv") cps1_in <- read_csv(cps1_path, col_types = cols())
Warning: Missing column names filled in: 'X1' [1]
cps1_in <- cps1_in %>% rename( no_deg = nodeg, real_earn_74 = re74, real_earn_75 = re75, real_earn_78 = re78 ) cps <- cps1_in %>% select(-X1) head(cps)
# A tibble: 6 x 10 trt age educ black hisp marr no_deg real_earn_74 real_earn_75 <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 0 45 11 0 0 1 1 21517. 25244. 2 0 21 14 0 0 0 0 3176. 5853. 3 0 38 12 0 0 1 0 23039. 25131. 4 0 48 6 0 0 1 1 24994. 25244. 5 0 18 8 0 0 1 1 1669. 10728. 6 0 22 11 0 0 1 1 16366. 18449. # ... with 1 more variable: real_earn_78 <dbl>
Create an indicator variable to identify someone who had no earnings (unemployed) in any of the years.
The tidyverse has a set of
map()
functions that allow you to apply a function over a set of parameters. For example, if you have a function that takes two parameters, and two vectors (variables) that each contain values for one of the parameters, then map would allow you to use each row of these vectors as parameters in a call to the function. A list of the same length as the variables is returned.The
map()
function applies one parameter to the function,map2()
applies two vectors as parameters, andpmap()
maps a list of parameters to the function.Each of the mapping functions have variants of
*_if()
,*_at()
,*_dbl()
, etc. These allow controls over which variables to include in a map or what the returned object is. The returned object from*_dbl()
is a vector of numeric values.This example will use the
pmap_lgl()
function. The first parameter is the variables to be passed as parameters. This is done using theselect()
function. Thepmap_lgl()
function works the same aspmap()
except a logical (boolean) vector is returned.Note, some functions may need the parameters passed as a vector instead of a list. The
lift_vd()
function can be used to do this.In the relationships between variables considered here, the variables are being tested for a common condition. As such, the variable values are gathered as vector and the test is applied to the vector. The set of parameters into the mapped function are identified with
...
, which collects all the parameters together. Theany()
andall()
functions are useful to determine if one or more variables met a condition or all of them do.cps <- cps %>% mutate( ever_unemployed = pmap_lgl(select(., real_earn_74:real_earn_78), ~ any(c(...) == 0)) ) cps %>% select(trt, age, real_earn_74:real_earn_78, ever_unemployed) %>% group_by(ever_unemployed) %>% group_modify(~ head(.x, 3))
# A tibble: 6 x 6 # Groups: ever_unemployed [2] ever_unemployed trt age real_earn_74 real_earn_75 real_earn_78 <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 FALSE 0 45 21517. 25244. 25565. 2 FALSE 0 21 3176. 5853. 13496. 3 FALSE 0 38 23039. 25131. 25565. 4 TRUE 0 45 25862. 0 3925. 5 TRUE 0 46 19171. 1318. 0 6 TRUE 0 49 392. 0 0
Note the use of
group_by()
andgroup_modify()
to construct a dataframe that shows a few examples from each level of theevery_unemployed
variable.The following alternate implementation uses the
min()
function to identify the minimum salary earned. If the minimum is 0, then they were unemplyed at least one year.cps <- cps %>% mutate( ever_unemployed = pmap_dbl(select(., real_earn_74:real_earn_78), ~ min(c(...))) == 0 ) cps %>% select(trt, age, real_earn_74:real_earn_78, ever_unemployed) %>% group_by(ever_unemployed) %>% group_modify(~ head(.x, 3))
# A tibble: 6 x 6 # Groups: ever_unemployed [2] ever_unemployed trt age real_earn_74 real_earn_75 real_earn_78 <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 FALSE 0 45 21517. 25244. 25565. 2 FALSE 0 21 3176. 5853. 13496. 3 FALSE 0 38 23039. 25131. 25565. 4 TRUE 0 45 25862. 0 3925. 5 TRUE 0 46 19171. 1318. 0 6 TRUE 0 49 392. 0 0
It is best to use the simplest and clearest approach to achive the needed results. Although the use of mathematical functions to achieve logical tests are farily common, they may not be the clearest expression of what is being done. It is best to see if a set of logical expressions can be used and are a clearer expression of what is being done.
Create an indicator variable to identify someone who was never unemployed in any of the years.
This is a check to see if the earnings for all years is greater than 0.
cps <- cps %>% mutate( never_unemployed = pmap_lgl(select(., real_earn_74:real_earn_78), ~ all(c(...) > 0)) ) cps %>% select(trt, age, real_earn_74:real_earn_78, never_unemployed) %>% group_by(never_unemployed) %>% group_modify(~ head(.x, 3))
# A tibble: 6 x 6 # Groups: never_unemployed [2] never_unemployed trt age real_earn_74 real_earn_75 real_earn_78 <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 FALSE 0 45 25862. 0 3925. 2 FALSE 0 46 19171. 1318. 0 3 FALSE 0 49 392. 0 0 4 TRUE 0 45 21517. 25244. 25565. 5 TRUE 0 21 3176. 5853. 13496. 6 TRUE 0 38 23039. 25131. 25565.
The following implementation uses the
%in%
opperator. The%in%
operator returns true if what is on left side in the vector of values on the right side. This can be useful when there is more than one possible matching criteria.cps <- cps %>% mutate( never_unemployed = pmap_dbl(select(., real_earn_74:real_earn_78), ~ sum((c(...) %in% 0))) == 0 ) cps %>% select(trt, age, real_earn_74:real_earn_78, never_unemployed) %>% group_by(never_unemployed) %>% group_modify(~ head(.x, 3))
# A tibble: 6 x 6 # Groups: never_unemployed [2] never_unemployed trt age real_earn_74 real_earn_75 real_earn_78 <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 FALSE 0 45 25862. 0 3925. 2 FALSE 0 46 19171. 1318. 0 3 FALSE 0 49 392. 0 0 4 TRUE 0 45 21517. 25244. 25565. 5 TRUE 0 21 3176. 5853. 13496. 6 TRUE 0 38 23039. 25131. 25565.
5.7.3 Examples - Python
These examples use the cps1.csv
data set.
We begin by loading the packages, importing the csv file, and naming variables.
from pathlib import Path import pandas as pd import numpy as np
cps1_path = Path('..') / 'datasets' / 'cps1.csv' cps1_in = pd.read_csv(cps1_path) cps1_in = ( cps1_in.rename( columns={ 'nodeg': 'no_deg', 're74': 'real_earn_74', 're75': 'real_earn_75', 're78': 'real_earn_78'})) cps = cps1_in.copy(deep=True) print(cps.dtypes)
Unnamed: 0 int64 trt int64 age int64 educ int64 black int64 hisp int64 marr int64 no_deg int64 real_earn_74 float64 real_earn_75 float64 real_earn_78 float64 dtype: object
print(cps.head())
Unnamed: 0 trt age educ ... no_deg real_earn_74 real_earn_75 real_earn_78 0 1 0 45 11 ... 1 21516.670 25243.550 25564.670 1 2 0 21 14 ... 0 3175.971 5852.565 13496.080 2 3 0 38 12 ... 0 23039.020 25130.760 25564.670 3 4 0 48 6 ... 1 24994.370 25243.550 25564.670 4 5 0 18 8 ... 1 1669.295 10727.610 9860.869 [5 rows x 11 columns]
Create an indicator variable to identify someone who had no earnings in any of the years.
Pandas has a set of methods that can be applied to all elements of a data frame (or a series.) A few examples of these functions are;
abs()
,isin()
,isna()
,notna()
,mask()
,where()
,eq()
,ne()
,ge()
,gt()
,le()
, andlt()
. We have already used several of these functions such aswhere()
andisna()
. It is the comparison methods that will be useful for these examples.Pandas also has a set of methods that can be applied across rows or columns of a data frame. These functions take a set of values and return a single value. A few examples of these functions are;
min()
,max()
,mean()
,sum()
,all()
, andany()
. These methods take theaxis
parameter, that tells the method use ('index') row data as the parameters or ('columns') for columns to be used as the parameters. For example,all(axis='index')
will returnTrue
for any columun where all the values areTrue
.This example uses the
eq()
method to test for unemployed and theany()
method to determine if this occured one or more times. The theaxis
parameter is set to1
to find the minimum for each row.cps = ( cps.assign(ever_unemployed = lambda df: df .loc[:, ['real_earn_74', 'real_earn_75', 'real_earn_78']] .eq(0) .any(axis='columns'))) (cps .loc[:, ['ever_unemployed', 'age', 'real_earn_74', 'real_earn_75', 'real_earn_78']] .groupby('ever_unemployed') .head(3) .pipe(print))
ever_unemployed age real_earn_74 real_earn_75 real_earn_78 0 False 45 21516.6700 25243.550 25564.670 1 False 21 3175.9710 5852.565 13496.080 2 False 38 23039.0200 25130.760 25564.670 9 True 45 25862.3200 0.000 3924.842 17 True 46 19171.4300 1317.677 0.000 20 True 49 391.8534 0.000 0.000
The
apply()
method can be used to apply functions that take a set of values and return a single value for each row or column in a data frame. This is useful when you want to apply a function to rows or columns that is not defined by pandas.We will use
apply()
with a lamba function to provide a solution to this example. The columns (or rows) can be matched to different parameters in the function if needed. The use ofapply()
is a flexible approach to operating on all rows (or columns.)cps = ( cps.assign(ever_unemployed = lambda df: df .apply(lambda row: (pd.Series([row['real_earn_74'], row['real_earn_75'], row['real_earn_78']]) .eq(0) .any()), axis='columns') .eq(0))) (cps .loc[:, ['ever_unemployed', 'age', 'real_earn_74', 'real_earn_75', 'real_earn_78']] .groupby('ever_unemployed') .head(3) .pipe(print))
ever_unemployed age real_earn_74 real_earn_75 real_earn_78 0 True 45 21516.6700 25243.550 25564.670 1 True 21 3175.9710 5852.565 13496.080 2 True 38 23039.0200 25130.760 25564.670 9 False 45 25862.3200 0.000 3924.842 17 False 46 19171.4300 1317.677 0.000 20 False 49 391.8534 0.000 0.000
Note that while all the parameters of the function were columns of a data frame in this example, scalar values could also be passed as parameters.
The following alternate implementation uses the
min()
function to identify the minimum salary earned. If the minimum is 0, then they were unemplyed at least one year.cps = ( cps.assign(ever_unemployed = lambda df: df .loc[:, ['real_earn_74', 'real_earn_75', 'real_earn_78']] .min(axis='columns') .eq(0))) (cps .loc[:, ['ever_unemployed', 'age', 'real_earn_74', 'real_earn_75', 'real_earn_78']] .groupby('ever_unemployed') .head(3) .pipe(print))
ever_unemployed age real_earn_74 real_earn_75 real_earn_78 0 False 45 21516.6700 25243.550 25564.670 1 False 21 3175.9710 5852.565 13496.080 2 False 38 23039.0200 25130.760 25564.670 9 True 45 25862.3200 0.000 3924.842 17 True 46 19171.4300 1317.677 0.000 20 True 49 391.8534 0.000 0.000
The following also uses the
min()
function. Themin()
function is done through theapply()
function.cps = ( cps.assign(ever_unemployed = lambda df: df .loc[:, ['real_earn_74', 'real_earn_75', 'real_earn_78']] .apply(lambda row: min(row) == 0, axis='columns'))) (cps .loc[:, ['ever_unemployed', 'age', 'real_earn_74', 'real_earn_75', 'real_earn_78']] .groupby('ever_unemployed') .head(3) .pipe(print))
ever_unemployed age real_earn_74 real_earn_75 real_earn_78 0 False 45 21516.6700 25243.550 25564.670 1 False 21 3175.9710 5852.565 13496.080 2 False 38 23039.0200 25130.760 25564.670 9 True 45 25862.3200 0.000 3924.842 17 True 46 19171.4300 1317.677 0.000 20 True 49 391.8534 0.000 0.000
It is best to use the simplest and clearest approach to achive the needed results. Although the use of mathematical functions to achieve logical test are farily common they may not be the clearest expression of what is being done. It is best to see if a set of logical expressions can be used and are a clearer expression of what is being done.
The more flexible approaches using
apply()
were shown here to demonstrate this important function with a simple examples. It is not meant to imply the use ofapply()
is the preferred solution for this problem.Create an indicator variable to identify someone who was never unemployed in any of the years.
This is a check to see if the earnings for all years is greater than 0.
cps = ( cps.assign(never_unemployed = lambda df: df .loc[:, ['real_earn_74', 'real_earn_75', 'real_earn_78']] .gt(0) .all(axis='columns'))) (cps .loc[:, ['never_unemployed', 'age', 'real_earn_74', 'real_earn_75', 'real_earn_78']] .groupby('never_unemployed') .head(3) .pipe(print))
never_unemployed age real_earn_74 real_earn_75 real_earn_78 0 True 45 21516.6700 25243.550 25564.670 1 True 21 3175.9710 5852.565 13496.080 2 True 38 23039.0200 25130.760 25564.670 9 False 45 25862.3200 0.000 3924.842 17 False 46 19171.4300 1317.677 0.000 20 False 49 391.8534 0.000 0.000
The following implementation uses the
isin()
method. This can be useful when there is more than one possible matching criteria. Here to achieve a test of not in,~
is used to negate the results of the lambda function.cps = ( cps.assign(never_unemployed = lambda df: ~df .loc[:, ['real_earn_74', 'real_earn_75', 'real_earn_78']] .isin([0]) .all(axis='columns'))) (cps .loc[:, ['never_unemployed', 'age', 'real_earn_74', 'real_earn_75', 'real_earn_78']] .groupby('never_unemployed') .head(3) .pipe(print))
never_unemployed age real_earn_74 real_earn_75 real_earn_78 0 True 45 21516.670 25243.550 25564.67 1 True 21 3175.971 5852.565 13496.08 2 True 38 23039.020 25130.760 25564.67 40 False 21 0.000 0.000 0.00 130 False 53 0.000 0.000 0.00 137 False 30 0.000 0.000 0.00
5.7.4 Exercises
These exercises use the Chile.csv
data set.
Import the
Chile.csv
file.Find all rows with a missing value in any column using a related columns method.