{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine learning III\n", "\n", "files needed = ('Hitters.csv')\n", "\n", "This book continues our foray into machine learning. Our goals here are modest. We would like to\n", "1. learn a bit about how machine learning is similar to, and different from, econometrics. \n", "2. introduce the scikit-learn package which is chock full of 'machine learning' tools. \n", "3. work on some *validation* methods, which are an important part of the machine learning toolkit. \n", "4. explore the ridge and lasso regression models\n", "\n", "In this notebook, we study the ridge and lasso regressions. These are methods that put discipline on the importance of the independent variables of the regression.\n", "\n", "This notebook is loosely based on Chapter 6 from *An Introduction to Statistical Learning* by James, Witten, Hastie, and Tibshirani. This is an easy to follow introduction that is light on the mathematics behind the methods." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Econometrics v. Statistical learning\n", "This is overly broad and general, but hopefully helpful. Consider the model\n", "\n", "$$y = f(X) + \\epsilon.$$\n", "\n", "### Econometrics\n", "Econometrics is mainly concerned with *inference*. By this, we mean that the goal is to understand the structure of $f(\\;)$. Econometricians are concerned about the 'true' value of the components of $f(\\;)$. We worry a lot about endogeneity, omitted variables, and whether the properties of $f(\\;)$ are consistent with the theory. Practically, \n", "\n", "1. The $X$ variables included in the model are guided by theory.\n", "2. The focus is on in-sample fit. How well does the model fit the data?\n", "\n", "### Statistical (machine) learning\n", "Statistical learning is mainly concerned with *prediction*. By this, we mean the ability of the model to predict the values of $y$, given $X$, for data that are not used to estimate (or, in machine learning-ese, train) the model of $f(\\;)$. The guiding principle here is the *bias-variance tradeoff*. \$more on this in a minute\$. Practically, \n", "\n", "1. The $X$ variables included in the model are guided by the mean-variance tradeoff. \n", "2. The focus is on out-of-sample fit. How well does the model predict data not used to estimate the model?\n", "\n", "\The bias-variance tradeoff exist in econometrics, too. Theory typically disciplines our definition of X.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluating model predictions\n", "\n", "The fundamental constraint in machine learning is the *bias-variance tradeoff*. Roughly speaking, there is no free lunch. \n", "\n", "Let's start with an example. Suppose we wanted to predict the number of people on the union terrace on a Friday. For 5 Fridays, we send out a team to count the number of people on the terrace (the y variable). We also record the temperature, the price of beer, if it is a home football weekend, the value of the stock market, the number of sailboats on Mendota, and the euro-dollar exchange rate (the X variables). \n", "\n", "We use the data to estimate our model y=f(X)+\\epsilon and use the X data for a 6th Friday to predict the number of people on the terrace, \\hat{y}. We then evaluate our estimate by comparing our prediction to the actual number of people on the terrace on the 6th Friday, (y - \\hat{y})^2. This is a measure of how well our model works at predicting **out of sample**. \n", "\n", "Now, suppose we repeat this experiment M times. We collect M data sets, estimate the model M times, predict the 6th Friday M times. We can form the expected test mse as\n", "\n", "\\frac{1}{M}\\sum_{m=1}^M(y_m-\\hat{y}_m)^2\n", "\n", "It is straightforward to show that this expression can be decomposed into \n", "\n", "\\begin{align*}\n", "\\frac{1}{M}\\sum_{m=1}^M(y_m-\\hat{y}_m)^2 & = \\left(E\\left[\\,\\hat{f}(X)\\right]-f(X) \\right)^2 + E\\,\\left(\\,\\hat{f}(X)-E\\left[\\,\\hat{f}(X)\\right]\\right)^2 + \\text{var}(\\epsilon)\\\\\n", " &= \\text{bias}^2 + \\text{variance} + \\text{var}(\\epsilon).\\\\\n", "\\end{align*}\n", "\n", "Note that all three terms are positive. We can't do anything about the third term, it is the *irreducible* error. The other two terms we would like to make as small as possible. Unfortunately, shrinking one of the two terms tends to increase the other term. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bias and variance\n", "The **bias term** tells us, on average, how close we get to the true model. Are we systematically over- or under-predicting y?\n", "\n", "The **variance term** tells us how much our estimates vary as we use different training datasets to estimate the model.\n", "\n", "Quick check: Think about shooting arrows at a target. What does a low-variance high-bias attempt look like? A low-bias low-variance look like?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The bias-variance tradeoff\n", "\n", "We would like to minimize the bias and the variance of the test mse. How can we do so? \n", "\n", "In the linear models we are considering, complexity increases as we add more variables to X. This includes adding polynomials of our independent variables, interactions, etc. How does complexity influence the testing error? \n", "\n", "* As we **increase the complexity** of our model f(\\;) the **squared bias tends to decrease**. \n", "\n", "* As we **increase the complexity** of our model f(\\;) the **variance tends to increase**.\n", "\n", "This is the tradeoff. As we add features to the model, the bias decreases, but the variance increases. This gives rise to a u-shaped mse. This [figure](http://www-bcf.usc.edu/~gareth/ISL/Chapter2/2.12.pdf) from James et al. is a good illustration. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Overfitting\n", "\n", "Behind the bias-variance tradoff is the idea of overfitting. The more complex the model, the more it will capture variation in y due to the random error (\\epsilon). \n", "\n", "* This makes the model fit the data better (lower bias). We are capturing y behavior from both f(\\;) and \\epsilon.\n", "* This makes the model more variable. A new set of training data will have different \\epsilon's. The estimate will change to match these new values of \\epsilon. Since \\epsilon is not related to f(\\;), we are making the estimate noisier." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Shrinkage methods\n", "The bias-variance tradeoff says that we want to constrain our model's complexity. There are many, many, many ways to go about this. For linear models, two common and easy to grasp methods are the ridge and the lasso regression. \n", "\n", "Let's see how they work. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd # data handling\n", "import numpy as np # numerical methods\n", "import matplotlib.pyplot as plt # plots" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load data on baseball players. Each row is a player. The variable we would like to predict is salary." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "base = pd.read_csv('Hitters.csv')\n", "print(base.info())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "base.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data look okay, but we only have salary for 263 observations. Let's drop them. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "base = base.dropna()\n", "base.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### OLS\n", "\n", "Let's start with ols to get a feel for things. Start by loading some packages." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import linear_model # ols, ridge, lasso, \n", "from sklearn.preprocessing import StandardScaler # normalize variables to have mu=0, std=1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's choose some variables that are potentially useful for predicting salary. We are purposely making this set large. The goal is determine how to constrain our choices. \n", "\n", "The ridge regression works best if the X variables are on the same scale. The StandardScaler() normalizes the variables." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var_list = ['Hits', 'RBI', 'HmRun', 'Walks', 'Errors', 'Years', 'Assists', 'AtBat', 'Runs', 'CAtBat', 'CHits', 'CRuns', 'CWalks']\n", "\n", "# Standardize the X vars so they have mean = 0 and std = 1\n", "X = StandardScaler().fit_transform(base[var_list])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Estimate the OLS regression. Don't worry about the l2 norm stuff yet. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res_ols = linear_model.LinearRegression().fit(X, base['Salary'])\n", "coef_norm_ols = np.linalg.norm(res_ols.coef_, ord=2)\n", "\n", "print(res_ols.coef_)\n", "print('The l2 norm of the ols coefficients is {0:5.1f}.'.format(coef_norm_ols))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The ridge regression\n", "\n", "The ridge regression chooses \\beta to minimize the residual sum of squares plus a penalty function \n", "\n", "\n", "\\begin{align*}\n", "=& \\text{RSS}+ \\alpha \\left(\\sum_{j=1}^p \\beta_j^2\\right)^{0.5}\\\\\n", "=&\\sum_{i=1}^n(y_i-\\hat{y}_i)^2 + \\alpha \\left(\\sum_{j=1}^p \\beta_j^2\\right)^{0.5}\n", "\\end{align*}\n", "\n", "\n", "OLS minimizes RSS, so if \\alpha=0 ridge collapses to OLS. We call \\alpha the tuning parameter. When \\alpha>0, models are penalized for how large their coefficients are. The term multiplying \\alpha is the l2 norm of the coefficient vector. \n", "\n", "The Ridge() function is part of linear_models in scikit-learn [(docs)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's estimate the model with \\alpha=0. This should return the ols estimate. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res_ridge_0 = linear_model.Ridge(alpha = 0.0).fit(X, base['Salary'])\n", "coef_norm_r0 = np.linalg.norm(res_ridge_0.coef_, ord=2)\n", "\n", "print(res_ridge_0.coef_)\n", "print('The l2 norm of the ridge(a=0) coefficients is {0:5.1f}.'.format(coef_norm_r0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now estimate the ridge model with \\alpha=100. This adds a penalty to each coefficient that is not zero. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res_ridge_100 = linear_model.Ridge(alpha = 100.0).fit(X, base['Salary'])\n", "coef_norm_r100 = np.linalg.norm(res_ridge_100.coef_, ord=2)\n", "\n", "print(res_ridge_100.coef_)\n", "print('The l2 norm of the ridge(a=100) coefficients is {0:5.1f}.'.format(coef_norm_r100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That decreased the norm of the coefficients quite a bit. We can keep going... " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res_ridge_800 = linear_model.Ridge(alpha = 800.0).fit(X, base['Salary'])\n", "coef_norm_r800 = np.linalg.norm(res_ridge_800.coef_, ord=2)\n", "\n", "print(res_ridge_800.coef_)\n", "print('The l2 norm of the ridge(a=800) coefficients is {0:5.1f}.'.format(coef_norm_r800))\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame({'var': var_list, 'ols': res_ols.coef_ , 'ridge 100':res_ridge_100.coef_, 'ridge 800':res_ridge_800.coef_ })" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What size penalty? \n", "\n", "We see that increasing \\alpha decreases the norm of the coefficients. How big should \\alpha be? We want the \\alpha that gives us the best test mse. As we discussed last week, cross-validation methods allow us to evaluate test mses.\n", "\n", "The scikit package has a method that combines the ridge estimation with cross validation. It is called RidgeCV() [(docs)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We pass to RidgeCV() a list of the alpha values to try. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alpha_grid = [1e-3, 1e-2, 1e-1, 1, 1e2, 1e3, 1e4]\n", "\n", "# Setting 'store_cv_values' to true will hang on to all the test mses from the CV. Otherwise, it only keep the best one.\n", "model = linear_model.RidgeCV(alphas=alpha_grid, store_cv_values = True).fit(X,base['Salary'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function returns an object with useful attributes and methods. Let's look at a few." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The .alpha_ holds the best alpha\n", "print('The best alpha from the candidate alphas is {0}.'.format(model.alpha_))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "best_coef_ridge = pd.DataFrame({'var':var_list, 'coef': model.coef_}) \n", "print(best_coef_ridge)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Since I set 'store_cv_values' to true, I have the matrix of all the test mse. Columns correspond to alpha values, \n", "# and there is one row for each observation, since the function uses loocv\n", "\n", "print(model.cv_values_)\n", "print(model.cv_values_.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The mean test mse for each value of alpha\n", "model.cv_values_.mean(axis=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The lasso regression\n", "The lasso regression works like the ridge, but with a different penalty function. \n", "\n", "\n", "\\begin{align*}\n", "=& \\text{RSS}+ \\alpha \\sum_{j=1}^p | \\beta_j|\\\\\n", "=&\\sum_{i=1}^n(y_i-\\hat{y}_i)^2 + \\alpha \\sum_{j=1}^p | \\beta_j|\n", "\\end{align*}\n", "\n", "\n", "The penalty function here is the l1 norm, or the sum of the absolute values of the absolute values of the coefficient. The major difference between the ridge and the lasso is that the lasso can generate coefficients that are zero, dropping them from the model and making it simpler.\n", "\n", "The Lasso regression is part of linear_models in scikit-learn [(docs)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res_lasso_0 = linear_model.Lasso(alpha = 0.0).fit(X, base['Salary'])\n", "coef_norm_l0 = np.linalg.norm(res_lasso_0.coef_, ord=1)\n", "\n", "print(res_lasso_0.coef_)\n", "print('The l2 norm of the lasso(a=0) coefficients is {0:5.1f}.'.format(coef_norm_l0))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res_lasso_2 = linear_model.Lasso(alpha = 2.0).fit(X, base['Salary'])\n", "coef_norm_2 = np.linalg.norm(res_lasso_2.coef_, ord=1)\n", "\n", "lassos = pd.DataFrame({'var':var_list, 'lasso 2': res_lasso_2.coef_}) \n", "print(lassos)\n", "print('\\nThe l1 norm of the lasso(a=2) coefficients is {0:5.1f}.'.format(coef_norm_2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Practice\n", "Take a few minutes and try the following. Feel free to chat with those around if you get stuck. The TA and I are here, too." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Try estimating a lasso regression when \\alpha = 5. Compare the coefficient vector to the lasso with \\alpha=2 from above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Estimate lasso regressions with \\alpha equal to 10 and 200. What is happening to the coefficients? What is happening to the norm of the coefficients? Does this make sense? " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Use the LassoCV() function [(docs)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html#sklearn.linear_model.LassoCV) to estimate lasso regressions for the following grid of alphas:\n", "python\n", "alpha_grid = [1, 1.5, 2, 3, 4, 5, 6, 8, 10]\n", "\n", "Set the cv parameter to 10. This means the method will use k-fold cross validation with k=10." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. Which \\alpha worked best? \n", "5. Use the .mse_path_ attribute of the object returned by LassoCV() to return a (9,10) sized array. The 9 corresponds to the number of alphas and the 10 corresponds to the test mses from the 10-fold cross validation the function is using. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6. Create a plot with \\alpha on the x-axis and the average test mse on the y-axis. To make sure the mse correspond to the correct alpha, use the .alphas_ attribute of the results object for the x values. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7. Now let the computer sort out the best tuning parameter. Call LassoCV() as before, but do not use the 'alphas' argument. The algorithm will search for the best \\alpha. \n", "\n", "What is the optimal \\alpha? \n", "\n", "What \\alpha's did the algorithm try? \\[Use the .alphas_ attribute.\" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "8. Using the mse_path_ and alphas_ attributes, create a plot with $\\alpha$ on the x-axis and the average test mse on the y-axis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "9. Display the optimal coefficient vector. Did the lasso eliminate any variables? " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Attachments", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }