5.3 Numeric variables
5.3.1 Data concepts
5.3.1.1 Vectors/sequences
The columns of data frames in the tidyverse and pandas are vectors and sequences respectively. Vectors and sequences are ordered one dimensional structures in which the elements are all the same type.
The order of vectors and sequence is important for data frames. This allows the observations of a data frame to be represented by the rows.
5.3.1.2 Lists
Lists are similar to vectors and sequences in many regards. They are one dimensional and ordered. The difference is that the types of the elements can be different.
One powerful feature of lists is that lists can be nested within lists. As you learn more you may see these structures. This book does not cover them.
List are used by both the tidyverse and pandas. They will be used more frequently going forward in this book.
5.3.1.3 Dictionary
A dictionary is a Python object that is one dimensional with no order. The only method to access data in a dictionary is with a name. There is no numeric index for a dictionary (unless the names are given as numbers.) This structure is used to associate names with objects.
R has no user class that is similar to dictionaries in Python. The environment objects can be used as dictionaries in R. This is an advanced use that will not be covered in this book.
5.3.2 Programming skills - Variables not in a data frame.
Both R and Python allow variables to be copied from a data frame. This can be convenient to do since there are many functions in both R and Python that operate on vectors and lists.
Variables can be manipulated outside of the data frame. The variable can also be put back in the data frame. It can seem convenient to manipulate variables independently of the data frame, but this must be done with care to insure that the observations line up properly when a variable is added back to the data frame. Some of the most difficult problems to debug occur when a variable is manipulated outside of its data frame and the order of its or the data frame's rows are changed before it is put back in the data frame. Manipulating the variables of a data frame without saving them external to the data frame is the best way to avoid these kinds of problems.
5.3.3 Examples - R
These examples use the Forbes2000.csv
data set.
We begin by loading the tidyverse, importing the csv file, and naming the variables.
library(tidyverse)
forbes_path <- file.path("..", "datasets", "Forbes2000.csv") forbes_in <- read_csv(forbes_path, col_types = cols())
Warning: Missing column names filled in: 'X1' [1]
forbes_in <- rename(forbes_in, market_value = marketvalue) forbes <- forbes_in %>% select(-X1) glimpse(forbes)
Observations: 2,000 Variables: 8 $ rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15... $ name <chr> "Citigroup", "General Electric", "American Intl G... $ country <chr> "United States", "United States", "United States"... $ category <chr> "Banking", "Conglomerates", "Insurance", "Oil & g... $ sales <dbl> 94.71, 134.19, 76.66, 222.88, 232.57, 49.01, 44.3... $ profits <dbl> 17.85, 15.59, 6.46, 20.96, 10.27, 10.81, 6.66, 7.... $ assets <dbl> 1264.03, 626.93, 647.66, 166.99, 177.57, 736.45, ... $ market_value <dbl> 255.30, 328.54, 194.87, 277.02, 173.54, 117.55, 1...
Converting a variable to numeric
Numeric variables are sometimes not imported as numeric due to missing value symbols. In other cases numeric values are not imported as numeric due to the inclusion of non-digit charters, such as
$
,,
,%
. When the text that prevented the variable from being parsed as a numeric variable are corrected, such as missing value symbols have been changed toNA
and non-numeric characters taken care of, the variable can then be converted to numeric usingparse_number()
.All of the forbes numeric variables were read as numeric. We will start this example by converting the
sales
variable to character and putting a$
in front using concatenation, thestr_c()
function.forbes <- mutate(forbes, sales = str_c("$", sales)) forbes %>% select(name, country, sales) %>% glimpse()
Observations: 2,000 Variables: 3 $ name <chr> "Citigroup", "General Electric", "American Intl Group"... $ country <chr> "United States", "United States", "United States", "Un... $ sales <chr> "$94.71", "$134.19", "$76.66", "$222.88", "$232.57", "...
Now we can convert the the
sales
column to a numeric variable.The
parse_number()
function converts the sales values to numbers.forbes <- forbes %>% mutate( sales = parse_number(sales) ) forbes %>% select(sales) %>% head()
# A tibble: 6 x 1 sales <dbl> 1 94.7 2 134. 3 76.7 4 223. 5 233. 6 49.0
The base R
as.numeric()
function requires that all characters are numbers. The tidyverseparse_number()
is a much more flexible approach to converting characters to numeric values.This can be seen by using
as.numeric()
to covertsales
with the$
added.sales <- as.numeric(str_c("$", forbes$sales))
Warning: NAs introduced by coercion
head(sales)
[1] NA NA NA NA NA NA
This produces all
NA
s.Variables that are not included as part of a data frame.
In this example we create a
sales
variable that is not part of the forbes data frame.Recall that the tidyverse data frame is a tibble. Subsetting a column of a tibble returns a data frame (tibble) with one column. If the column is to be used as a vector, one does a
pull()
on the column. Base R data frames will return a vector if a single column is subset from a data frame. A variable that has been removed from a data frame or tibble is a vector.sales <- forbes %>% pull(sales) class(sales)
[1] "numeric"
Pulling vectors from data frames can be useful when exploring data. For example, if you wanted a summary of the values of only the
sales
variable, one could do the following.forbes %>% pull(sales) %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max. 0.010 2.018 4.365 9.697 9.547 256.330
Note, a list can be added as a column to a data frame. Many data frame methods and functions will not operate correctly with a list-based column. There are times when a list-based variable allows for some very powerful programming. This use of lists as columns will not be covered in this book.
The
sales
vector can be converted to a list.sales <- as.list(sales) head(sales, 4)
[[1]] [1] 94.71 [[2]] [1] 134.19 [[3]] [1] 76.66 [[4]] [1] 222.88
class(sales)
[1] "list"
Creating numeric variables.
Numeric variables result from any expresion made up of functions/methods and operators that return a numeric value for each row.
Here we calculate the price-earnings (pe) ratio for the companies three different ways. The first uses only the divide operator. The second uses
if_else()
to addNA
for non-positive profits. The third uses the base Rif_else()
to returnno earnings
when profits are non-positive. The result of the third pe calculation is of type character.forbes <- forbes %>% mutate( pe = market_value / profits, pe2 = if_else(profits > 0, market_value / profits, NA_real_), pe3 = if_else(profits > 0, as.character(market_value / profits), "no earnings") ) filter(forbes, profits <= 0) %>% select(name, profits, pe, pe2, pe3) %>% head()
# A tibble: 6 x 5 name profits pe pe2 pe3 <chr> <dbl> <dbl> <dbl> <chr> 1 Allianz Worldwide -1.23 -39.1 NA no earnings 2 Vodafone -15.5 -11.3 NA no earnings 3 Deutsche Telekom -25.8 -3.26 NA no earnings 4 Credit Suisse Group -2.4 -18.4 NA no earnings 5 France Telecom -21.8 -2.96 NA no earnings 6 Generali Group -0.79 -44.4 NA no earnings
5.3.4 Examples - Python
These examples use the Forbes2000.csv
data set.
We begin by loading the packages, importing the csv file, and naming the variables.
from pathlib import Path import pandas as pd import numpy as np
forbes_path = Path('..') / 'datasets' / 'Forbes2000.csv' forbes_in = pd.read_csv(forbes_path) forbes_in = ( forbes_in .rename(columns={'marketvalue': 'market_value'})) forbes = forbes_in.copy(deep=True) print(forbes.dtypes)
Unnamed: 0 int64 rank int64 name object country object category object sales float64 profits float64 assets float64 market_value float64 dtype: object
Converting a variable to numeric
Numeric variables are sometimes not imported as numeric due to missing value symbols. In other cases numeric values are not imported as numeric due to the inclusion of non-digit charters, such as
$
,,
,%
. When the missing value symbols have been changed tonp.NaN
and the other non-numeric characters taken care of, the variable can then be converted to numeric.All of the forbes numeric variables were read as numeric. We will start this example by converting the
sales
variable to character and putting a$
in front using concatenation.forbes.sales = '$' + forbes.sales.astype(str) (forbes .loc[:, ['name', 'country', 'sales']] .head(5) .pipe(print))
name country sales 0 Citigroup United States $94.71 1 General Electric United States $134.19 2 American Intl Group United States $76.66 3 ExxonMobil United States $222.88 4 BP United Kingdom $232.57
Now we can convert the the
sales
column to a numeric variable.The
to_numeric()
pandas function parses string variables and produces a numeric variable.forbes = ( forbes .assign( sales=lambda df: df.sales.str.extract('\$(.+)', expand=False)) .assign( sales=lambda df: pd.to_numeric(df.sales, errors='coerce'))) (forbes .loc[:, ['name', 'country', 'sales']] .head(5) .pipe(print))
name country sales 0 Citigroup United States 94.71 1 General Electric United States 134.19 2 American Intl Group United States 76.66 3 ExxonMobil United States 222.88 4 BP United Kingdom 232.57
print(forbes.sales.dtype)
float64
The pandas methods
astype('float64')
andastype('int64')
are not as flexible asto_numeric()
in converting from strings to numeric, but you might see them in exmaples on the web.Variables that are not included as part of a data frame.
In this example we create a
sales
variable that is not part of the forbes data frame. The type of the stand-alone variable is a series.sales = forbes['sales'] print(type(sales))
<class 'pandas.core.series.Series'>
The
sales
series can be converted to a list. The list is created using the[]
operator. Thelist()
function could also be used.sales = [forbes['sales']] print(type(sales))
<class 'list'>
The
sales
series can be converted to annp
array object using thevalues
attribute.sales = forbes['sales'].values print(type(sales))
<class 'numpy.ndarray'>
Creating numeric variables.
Numeric variables result from any expresion made up of functions/methods and operators that return a numeric value for each row.
Here we calculate the price-earnings (pe) ratio for the companies three different ways. The first uses only the divide operator. The second uses
np.where()
to addnp.NaN
for non-positive profits. The third usesnp.where()
to add'no earnings'
for non-positive profits. The result of the third pe calculation is of type string.forbes = ( forbes .assign( pe=lambda df: df['market_value'] / df['profits']) .assign( pe2=lambda df: np.where(df['pe'] < 0, np.NAN, df['pe']), pe3=lambda df: np.where( df['pe'] < 0, 'no earnings', df['pe']))) (forbes .loc[:, ['name', 'profits', 'pe', 'pe2', 'pe3']] .query('profits <= 0') .head(5) .pipe(print))
name profits pe pe2 pe3 349 Allianz Worldwide -1.23 -39.081301 NaN no earnings 353 Vodafone -15.51 -11.257898 NaN no earnings 363 Deutsche Telekom -25.83 -3.259001 NaN no earnings 371 Credit Suisse Group -2.40 -18.387500 NaN no earnings 373 France Telecom -21.78 -2.955005 NaN no earnings
The calculation for the
pe
variable,df['market_value'] / df['profits']
, is the result of a mathematical operation on series that have been subset from the data frame. These kinds of mathematical operations preserve the order of the series.This example could also have been done using a series outside of the data frame and subsetting.
pe = forbes['market_value'] / forbes['profits'] forbes['pe'] = pe forbes['pe2'] = np.where(pe < 0, np.NAN, pe) forbes['pe3'] = np.where(pe < 0, 'no earnings', pe) (forbes .loc[:, ['name', 'profits', 'pe', 'pe2', 'pe3']] .query('profits <= 0') .head(5) .pipe(print))
name profits pe pe2 pe3 349 Allianz Worldwide -1.23 -39.081301 NaN no earnings 353 Vodafone -15.51 -11.257898 NaN no earnings 363 Deutsche Telekom -25.83 -3.259001 NaN no earnings 371 Credit Suisse Group -2.40 -18.387500 NaN no earnings 373 France Telecom -21.78 -2.955005 NaN no earnings
The results are the same as with method chaining. This approach is a common coding stlye and results in code that executes quickly and is still readable. Here the
pe
series, which is outside of the data frame, is created and then inserted into the data frame almost immediately. There is no opportunity for the order of either the series or the data frame to change. If the creation of the series and the insertion into the data frame do not occur right next to each other as in this example, the data frame that is being worked on may be manipulated in some way that changes the order. Then the series, that is outside the data frame, is not aligned with the observations of the data frame. These kinds of error can go undetected for a long time and are difficult to find. Theassign()
andlambda
approach protects the relationship of the observations, at the cost of a little extra typing.
5.3.5 Exercises
These exercises use the mtcars.csv
data set.
Import the
mtcars.csv
data set.The
wt
variable is measured in thousands of pounds. Change this variable to a character variable that has a comma separating the thousands digit from the hundreds digit, e.g. 2.14 becomes 2,140.Hint, for one of the possible solution you may find it useful to look for a string function/method that will pad. Padding adds characters to a fixed width.
Convert the character variable you created in the prior exercise to a new numeric variable. Make the units of measure for this new variable 1,000 of pounds.