5.4 Factors and Indicators
5.4.1 Data concepts
5.4.1.1 Indicator variables
An Indicator variable is a categorical variable that has exactly two levels. Logical variables are an example of an indicator variable.
These are an important class of variables for many analyses where factor variable must be converted to a set of indicator variables. Indicators variables often use the values 0 and 1 for the two levels, but not always.
5.4.1.2 Factor variables from numeric variables
Numeric variables can be converted to a factor variable by collapsing values that fall within a set of intervals. This is a form of data reduction. Data reductions typically do not improve an analysis. On the other hand converting a numeric variable to a factor can sometimes make it much easier to see patterns in data during exploration. For example, converting a numeric variable to high, medium, and low intervals allow the variable to be used in facets to see if there are visual differences in a plot.
5.4.2 Examples - R
These examples use the Forbes2000.csv
data set.
We begin by loading the tidyverse, importing the csv file, and naming variables.
library(tidyverse)
forbes_path <- file.path("..", "datasets", "Forbes2000.csv") forbes_in <- read_csv(forbes_path, col_types = cols())
Warning: Missing column names filled in: 'X1' [1]
forbes_in <- rename(forbes_in, market_value = marketvalue) forbes <- forbes_in %>% select(-X1) glimpse(forbes)
Observations: 2,000 Variables: 8 $ rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15... $ name <chr> "Citigroup", "General Electric", "American Intl G... $ country <chr> "United States", "United States", "United States"... $ category <chr> "Banking", "Conglomerates", "Insurance", "Oil & g... $ sales <dbl> 94.71, 134.19, 76.66, 222.88, 232.57, 49.01, 44.3... $ profits <dbl> 17.85, 15.59, 6.46, 20.96, 10.27, 10.81, 6.66, 7.... $ assets <dbl> 1264.03, 626.93, 647.66, 166.99, 177.57, 736.45, ... $ market_value <dbl> 255.30, 328.54, 194.87, 277.02, 173.54, 117.55, 1...
Make the
category
variable a factor variableThe
factor()
function can be used to convert variables to factor variables. This is a base R function and it works well with the tidyverse.forbes <- forbes %>% mutate( category = factor(category) ) glimpse(forbes)
Observations: 2,000 Variables: 8 $ rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15... $ name <chr> "Citigroup", "General Electric", "American Intl G... $ country <chr> "United States", "United States", "United States"... $ category <fct> Banking, Conglomerates, Insurance, Oil & gas oper... $ sales <dbl> 94.71, 134.19, 76.66, 222.88, 232.57, 49.01, 44.3... $ profits <dbl> 17.85, 15.59, 6.46, 20.96, 10.27, 10.81, 6.66, 7.... $ assets <dbl> 1264.03, 626.93, 647.66, 166.99, 177.57, 736.45, ... $ market_value <dbl> 255.30, 328.54, 194.87, 277.02, 173.54, 117.55, 1...
Factor
category
usingparse_factor()
.The tidyverse function
parse_factor()
will convert a variable to a factor variable. This tidyverse function does additional checks that are not done in the base R functionfactor()
. For example,parse_factor()
produces warnings for values that do not match the expected levels.We start the example by creating the set of levels to use in creating the factor variable.
forbes <- mutate(forbes, category = as.character(category)) category_lev <- forbes %>% select(category) %>% distinct(category) %>% arrange(category) %>% pull() head(category_lev)
[1] "Aerospace & defense" "Banking" [3] "Business services & supplies" "Capital goods" [5] "Chemicals" "Conglomerates"
The levels are now used to create the factor variable.
forbes <- forbes %>% mutate( category = parse_factor(category, levels = category_lev) ) glimpse(forbes)
Observations: 2,000 Variables: 8 $ rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15... $ name <chr> "Citigroup", "General Electric", "American Intl G... $ country <chr> "United States", "United States", "United States"... $ category <fct> Banking, Conglomerates, Insurance, Oil & gas oper... $ sales <dbl> 94.71, 134.19, 76.66, 222.88, 232.57, 49.01, 44.3... $ profits <dbl> 17.85, 15.59, 6.46, 20.96, 10.27, 10.81, 6.66, 7.... $ assets <dbl> 1264.03, 626.93, 647.66, 166.99, 177.57, 736.45, ... $ market_value <dbl> 255.30, 328.54, 194.87, 277.02, 173.54, 117.55, 1...
Create a factor variable from a numeric variable.
The
cut()
function from base R provides a means to specify a flexible set of interval ranges. The intervals are specified as a set of break points that will be used as lower and upper end points. The names of the intervals can be set using thelabels
parameter.forbes <- forbes %>% mutate( profit_lev = cut(profits, breaks = c(-Inf, .08, .44, 10, Inf), labels = c("low", "mid", "high", "very high") ) ) glimpse(forbes)
Observations: 2,000 Variables: 9 $ rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15... $ name <chr> "Citigroup", "General Electric", "American Intl G... $ country <chr> "United States", "United States", "United States"... $ category <fct> Banking, Conglomerates, Insurance, Oil & gas oper... $ sales <dbl> 94.71, 134.19, 76.66, 222.88, 232.57, 49.01, 44.3... $ profits <dbl> 17.85, 15.59, 6.46, 20.96, 10.27, 10.81, 6.66, 7.... $ assets <dbl> 1264.03, 626.93, 647.66, 166.99, 177.57, 736.45, ... $ market_value <dbl> 255.30, 328.54, 194.87, 277.02, 173.54, 117.55, 1... $ profit_lev <fct> very high, very high, high, very high, very high,...
The tidyverse has the
cut_interval()
,cut_number()
, andcut_width()
functions. These function divide the values into equal segments, measured by either widths or count of observations.Create an indicator variable to identify NAFTA countries.
The
%in%
operator is used to determine if the set of values on the left is in the set of values on the right.forbes <- forbes %>% mutate( nafta = country %in% c("United States", "Canada", "Mexico") ) glimpse(forbes)
Observations: 2,000 Variables: 10 $ rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15... $ name <chr> "Citigroup", "General Electric", "American Intl G... $ country <chr> "United States", "United States", "United States"... $ category <fct> Banking, Conglomerates, Insurance, Oil & gas oper... $ sales <dbl> 94.71, 134.19, 76.66, 222.88, 232.57, 49.01, 44.3... $ profits <dbl> 17.85, 15.59, 6.46, 20.96, 10.27, 10.81, 6.66, 7.... $ assets <dbl> 1264.03, 626.93, 647.66, 166.99, 177.57, 736.45, ... $ market_value <dbl> 255.30, 328.54, 194.87, 277.02, 173.54, 117.55, 1... $ profit_lev <fct> very high, very high, high, very high, very high,... $ nafta <lgl> TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE...
5.4.3 Examples - Python
These examples use the Forbes2000.csv
data set.
We begin by loading the packages, importing the csv file, and naming the variables.
from pathlib import Path import pandas as pd import numpy as np
forbes_path = Path('..') / 'datasets' / 'Forbes2000.csv' forbes_in = pd.read_csv(forbes_path) forbes_in = ( forbes_in .rename(columns={'marketvalue': 'market_value'})) forbes = forbes_in.copy(deep=True) print(forbes.dtypes)
Unnamed: 0 int64 rank int64 name object country object category object sales float64 profits float64 assets float64 market_value float64 dtype: object
Make the
category
variable a factor variableThe
.astype('category')
method can be used to convert variables to category variables.forbes = forbes.assign(category = lambda df: df['category'].astype('category')) print(forbes['category'].head())
0 Banking 1 Conglomerates 2 Insurance 3 Oil & gas operations 4 Oil & gas operations Name: category, dtype: category Categories (27, object): [Aerospace & defense, Banking, Business services & supplies, Capital goods, ..., Telecommunications services, Trading companies, Transportation, Utilities]
Factor
category
usingpd.Categorical()
.The pandas function
pd.Categorical()
will convert a variable to a category variable. This pandas function does additional checks that are not done in.astype('category')
. For example,pd.Categorical()
will producesnp.NaN
for values that do not match the expected level.We start the example by creating the set of levels to use in creating the factor variable.
The
unique()
method returns annp.array
object that needs to be converted to a pandas object to use pandas methods.forbes = forbes_in category_lev = pd.Series(forbes['category'].unique()).sort_values() print(category_lev.head())
19 Aerospace & defense 0 Banking 22 Business services & supplies 21 Capital goods 18 Chemicals dtype: object
The levels are now used to create the factor variable.
forbes = ( forbes .assign( category = lambda df: pd.Categorical(df['category'], categories=category_lev))) print(forbes['category'].head())
0 Banking 1 Conglomerates 2 Insurance 3 Oil & gas operations 4 Oil & gas operations Name: category, dtype: category Categories (27, object): [Aerospace & defense, Banking, Business services & supplies, Capital goods, ..., Telecommunications services, Trading companies, Transportation, Utilities]
The
pd.Categorical()
method has aordered
parameter that can be set toTrue
for an ordered categorical variable.Create a factor variable from a numeric variable.
The
pd.cut()
function from pandas provides a means to specify a flexible set of interval ranges. The intervals can be specified as a set of break points that will be used as lower and upper end points. The names of the intervals can be set using thelabels
parameter.forbes = ( forbes .assign( profit_lev = lambda df: pd.cut( df['profits'], bins=[-np.inf, .08, .44, 10, np.inf], labels=['low', 'mid', 'high', 'very high']))) print(forbes['profit_lev'].head())
0 very high 1 very high 2 high 3 very high 4 very high Name: profit_lev, dtype: category Categories (4, object): [low < mid < high < very high]
Create an indicator variable to identify NAFTA countries.
The
isin()
method is used to determine if the values of the object are in the list provided as a parameter toisin()
.forbes = ( forbes .assign( nafta = lambda df: df ['country'] .isin(['United States', 'Canada', 'Mexico']))) print(forbes['nafta'].head())
0 True 1 True 2 True 3 True 4 False Name: nafta, dtype: bool
Create indicator variables from category variables.
Pandas provides the
get_dummies()
method to convert categorical variables to sets of indicator variables.forbes_dum = pd.get_dummies(forbes, columns=['category']) print(forbes_dum.dtypes)
Unnamed: 0 int64 rank int64 name object country object sales float64 profits float64 assets float64 market_value float64 profit_lev category nafta bool category_Aerospace & defense uint8 category_Banking uint8 category_Business services & supplies uint8 category_Capital goods uint8 category_Chemicals uint8 category_Conglomerates uint8 category_Construction uint8 category_Consumer durables uint8 category_Diversified financials uint8 category_Drugs & biotechnology uint8 category_Food drink & tobacco uint8 category_Food markets uint8 category_Health care equipment & services uint8 category_Hotels restaurants & leisure uint8 category_Household & personal products uint8 category_Insurance uint8 category_Materials uint8 category_Media uint8 category_Oil & gas operations uint8 category_Retailing uint8 category_Semiconductors uint8 category_Software & services uint8 category_Technology hardware & equipment uint8 category_Telecommunications services uint8 category_Trading companies uint8 category_Transportation uint8 category_Utilities uint8 dtype: object
5.4.4 Exercises
These exercises use the mtcars.csv
data set.
Import the
mtcars.csv
data set.Factor the
cyl
,gear
andcarb
variables.Create a variable that identifies the observations that are in the top 25 percent of miles per gallon. Display a few of these vehicles.
Hint, you will need to find a function to identify the percentage points of a variable.
Create a variables that bins the values of
hp
using the following amounts of hp: 100, 170, 240, and 300.