{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Machine learning II\n",
"\n",
"files needed = ('auto.csv')\n",
"\n",
"This book continues our foray into machine learning. Our goals here are modest. We would like to\n",
"1. learn a bit about how machine learning is similar to, and different from, econometrics. \n",
"2. introduce the scikit-learn package which is chock full of 'machine learning' tools. \n",
"3. work on some *validation* methods, which are an important part of the machine learning toolkit. \n",
"\n",
"In this notebook, we continue to discuss validation methods.\n",
"\n",
"This notebook is loosely based on Chapters 2 and 5 from *An Introduction to Statistical Learning* by James, Witten, Hastie, and Tibshirani. This is an easy to follow introduction that is light on the mathematics behind the methods."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Cross validation\n",
"\n",
"In the last workbook, we studied *set validation*. \n",
"\n",
"1. We break our dataset up into two subsets. Call one the *training set* and the other the *testing set*. \n",
"2. We estimate our model on the training data. We measure how well our model fits the data by looking at the mean squared error of the regression or, alternatively, the r-squared, which normalizes the mse by the total squared error. We call these the **training mse or the training $r^2$**. This is usually what we care about when we are doing inference. \n",
"3. Using the X data in the testing data, we use the estimated $f(\\;)$ from step 2 to predict value for y\n",
"$$\\hat{y} = \\hat{f}(X_{test}).$$\n",
"4. We then compare the predicted ($\\hat{y}$) values with the actual ($y$) values in the training data. Again, we can use mse or r-squared to judge how well our predicted data matched the actual data. Call this the **test mse or test $r^2$**. This is what we usually care about when we are interested in prediction. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the last notebook we repeatedly performed set validation on a linear model. We found that the reported error measures varied considerably. This is because we were using a different 75% of the observations to estimate the model each time. As we changed the training data set, the estimates changed. \n",
"\n",
"A related drawback is that we are only using 75% of the observations to estimate the model each time. Typically, the more observations we use, the better our estimates will be, which will lead to better prediction. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Leave one out cross-validation (loocv)\n",
"\n",
"Leave one out cross-validation (loocv) addresses the concerns with set validation. The loocv algorithm is \n",
"\n",
"For each observation in the data $i = 0,...n$\n",
"1. The test data is observation $i$\n",
"2. The training data is all the data **except** observation $i$\n",
"3. Estimate the model on the training data\n",
"4. Evaluate the model on the testing data, this generates a mean squared error associated with 'round' $i$\n",
"\n",
"The test mse is the average over the $n$ mses computed:\n",
"$$ CV_{(n)} = \\frac{1}{n}\\sum_{i=1}^nmse_{i}$$\n",
"\n",
"Notice that this is similar in spirit to what we were doing last class. The difference is that we are doing are repeating the estimate exactly $n$ times and that the split into test and training sets is not random. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try it out. Rather than write code to implement the algorithm above, scikit will do it for us. \n",
"\n",
"First, load up the auto data again. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# There are a few missing values coded as '?'. \n",
"auto = pd.read_csv('auto.csv', na_values='?')\n",
"auto = auto.dropna()\n",
"print(auto.info())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will need the `linear_model` package to run the regression and the `cross_val_score` function from scikit-learn."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# From scikit-learn import the cross_val_score function\n",
"from sklearn.model_selection import cross_val_score\n",
"\n",
"# From scikit-learn, peel off the linear_model package\n",
"from sklearn import linear_model "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `cross_val_score()` function [(docs)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) takes as arguments an `estimator`, in our case ols, the `x data`, the `y data`, a `scoring` criterion, and the number of splits to make to the data, `cv`. \n",
"\n",
"Two comments:\n",
"\n",
"1. The r-squared is not a good metric of fit (score) in this case, because we are testing on a single observation. The r-squared will be zero always. Instead, we will use the mean squared error. In scikit learn, the **negative** of the mean squared error is returned. This is simply a convention: The package creators want all the scoring criterion to be increasing in fit. (e.g., a larger r-squared is better, a larger (closer to zero) negative mse is better).\n",
"2. We set the `cv` parameter to the number of observations in the data. We want to split the data into $n$ chunks (each chunk with one observation) and leave one chunk out each iteration. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# estimator = LinearRegression, # X_data = auto[['horsepower', 'weight']], y_data = auto['mpg']\n",
"\n",
"scores = cross_val_score(linear_model.LinearRegression(), auto[['horsepower', 'weight']], auto['mpg'],\n",
" scoring='neg_mean_squared_error', cv=len(auto['mpg']))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That was easy! The `cross_val_score` function returns an array with the negative mse for each iteration. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print('Scores is of type:', type(scores))\n",
"print('Length of scores:', len(scores), '\\n')\n",
"print(scores)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To compute the test mean squared error, \n",
"$$ CV_{(n)} = \\frac{1}{n}\\sum_{i=1}^nmse_{i}$$\n",
"we just compute the mean of scores. I will multiply by -1, so that we have the mse."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_mse = scores.mean()\n",
"print('The test mse is {0:5.3f}'.format(-test_mse))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### k-fold cross-validation\n",
"\n",
"k-fold cross-validation is a generalization of loocv. An issue with loocv is that we need to estimate the model $n$ times, once for each observation. If there is a lot of data, or if the model is hard to estimate, this approach is may be computationally intensive. \n",
"\n",
"The solution to this is to break the data up into k sets (or folds) of length n/k, each set containing n/k randomly assigned observations. We then proceed as in loocv, but estimating the model only $i=1,...,k$ times and leaving out set $i$ as the test set. This [figure](http://www-bcf.usc.edu/~gareth/ISL/Chapter5/5.5.pdf) from James, et al. summarizes it nicely. \n",
"\n",
"In this procedure, the test mse is \n",
"$$ CV_{(k)} = \\frac{1}{k}\\sum_{i=1}^k mse_{i}$$\n",
"\n",
"Clearly, loocv is just k-fold cv with $k=n$. \n",
"\n",
"How big is $k$? In practice, people tend to use 5 or 10. \n",
"\n",
"\\[There are additional reasons dealing with bias and variance to use $k