{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine learning I\n", "\n", "files needed = ('auto.csv')\n", "\n", "In the next few classes, we will introduce some ideas from 'machine learning.' Machine learning sounds really complicated, but the good news is that, as trained econometricians, you already know a lot of machine learning. \n", "\n", "Our goals here are modest. We would like to\n", "1. learn a bit about how machine learning is similar to, and different from, econometrics. \n", "2. introduce the scikit-learn package which is chock full of 'machine learning' tools. \n", "3. work on some *validation* methods, which are an important part of the machine learning toolkit. \n", "\n", "This notebook is loosely based on Chapters 2 and 5 from *An Introduction to Statistical Learning* by James, Witten, Hastie, and Tibshirani. This is an easy to follow introduction that is light on the mathematics behind the methods. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inference v. prediction\n", "\n", "Suppose we have data on a variable of interest $y$ and variables we think are important for determining $y$, $x_1, x_2,...x_p$. We assume that $y$ is related to the $x$ variables according to \n", "\n", "$$y = f(X)+\\epsilon$$\n", "\n", "where $X$ is the matrix of $x$ variables. The $f()$ function represents the *systematic* or *structural* relationship between $y$ and $X$ and $\\epsilon$ is the *noise* or *error term*.\n", "\n", "An important part of both machine learning and econometrics is estimating $f()$. Why we care about $f()$ is where machine learning and econometrics often (but not always) diverges. \n", "\n", "* Machine learning often prioritizes **prediction**: The estimate of $f()$ is important in that it is useful for out-of-sample prediction. Can I predict the value of the S&P 500 index tomorrow? \n", "* Econometrics often prioritizes **inference**: The estimate of $f()$ is important in that it tells us about how $y$ and $X$ are related. How is the S&P 500 value related to interest rates? Inference is often associated with some kind of theoretical model in mind that lays out relationships between the $y$ and $X$ variables. The estimate of $f()$ helps us quantify and better understand these models. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Specifying f( )\n", "\n", "How do we know which variables to include in $X$, or what is $f(\\; )$? In econometrics, we use a model to guide our choice of $f()$. This reflects econometrics' goal of inference. \n", "\n", "What if we are interested only in prediction? Shouldn't we just throw everything into $f(\\; )$ and see what happens? Shouldn't more data be better than less data? Shouldn't more flexibility in specifying $f(\\;)$ be better than less flexibility? \n", "\n", "Perhaps surprisingly, the answer is 'no.' An important problem when inference is not a concern in the that of *overfitting* the data. Overfitting occurs when our estimate of $f(\\;)$ captures too much of the noise, $\\epsilon$. The noise has nothing to do with the structural relationship $f(\\;)$, which will make our estimate $f(\\;)$ a bad predictor. \n", "\n", "An important part of the machine learning (prediction) approach is validating the model in respect to its ability to predict. This is a bit different than the approach we learned in econometrics, and we will dig into it in a minute... \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predicting with OLS\n", "\n", "Let's start with predicting a variable using ols as our method. This means that $f(\\;)$ is a linear function of $X$. As econometricians, we know a lot about ols. \n", "\n", "We first need to figure out how to perform ols in scikit-learn." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load data on automobile fuel efficiency. The data are available [here](http://www-bcf.usc.edu/~gareth/ISL/Auto.data). I have already converted the data into a csv file. Load it into a DataFrame." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# There are a few missing values coded as '?'. \n", "auto = pd.read_csv('auto.csv', na_values='?')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# What do we have?\n", "print(auto.head())\n", "print('\\n\\n')\n", "print(auto.info())\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like we are missing a few observations on horsepower. Drop them. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "auto = auto.dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Linear regression in scikit learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For machine learning in python, scikit is the standard. It contains both the methods to estimate $f(\\;)$ and the methods to validate the models. Linear regression is the starting point for estimating $f(\\; )$. \n", "\n", "Here is how it works in scikit learn [(docs)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# If you are running anaconda, this should already be installed. If not\n", "# pip install --user scikit-learn\n", "\n", "# From scikit-learn, peel off the linear_model package\n", "from sklearn import linear_model " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scikit-learn is a bit clunkier than statsmodels. In statsmodels we had a nice way to specify the regression model with a string. In scikit-learn, we directly pass the data to the .fit() method. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res = linear_model.LinearRegression().fit(auto[['horsepower', 'weight']], auto['mpg'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's break this down. \n", "\n", "1. The code linear_model.LinearRegression() creates a model object.\n", "2. The code .fit(auto[['horsepower', 'weight']], auto['mpg']) estimates a linear object where y=mpg and X = [horsepower, weight]. Notice that we pass X, then y: .fit(X_vars, y_var). As in statsmodels, we get the **constant for free**. \n", "3. Like in statsmodels, the fit method of the linear regression model returns a results object. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Try a tab complete. What looks good?\n", "res" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('The coefficients except the intercept:', res.coef_, '\\n')\n", "print('The intercept:', res.intercept_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set validation approach\n", "\n", "Suppose we would like to predict the mpg of a car given its characteristics. Which characteristics should we include? \n", "\n", "We would like to include characteristics that make the model good for prediction. Here is one way to go.\n", "\n", "1. We break our dataset up into two subsets. Call one the *training set* and the other the *testing set*. \n", "2. We estimate our model on the training data. We measure how well our model fits the data by looking at the mean squared error of the regression or, alternatively, the r-squared, which normalizes the mse by the total squared error. We call these the **training mse or the training $r^2$**. This is usually what we care about when we are doing inference. \n", "3. Using the X data in the testing data, we use the estimated $f(\\;)$ from step 2 to predict value for y\n", "$$\\hat{y} = \\hat{f}(X_{test}).$$\n", "4. We then compare the predicted ($\\hat{y}$) values with the actual ($y$) values in the training data. Again, we can use mse or r-squared to judge how well our predicted data matched the actual data. Call this the **test mse or test $r^2$**. This is what we usually care about when we are interested in prediction. \n", "\n", "An important conceptual point is that **improving the training mse does not always improve the test mse**. This is the overfitting problem. By making the model more flexible and adding more variables to $X$ we can improve the training mse (a lot). The problem is that we will start to capture more and more of the $\\epsilon$ values in our estimate of $f( )$, which will make it a worse estimate of the true $f(\\;)$. \n", "\n", "Note that using the test mse/r-squared to validate the model is a general idea. We are currently thinking about ols models of $f(\\;)$ but this approach (and modifications of it) are applicable to many different approaches to estimating $f(\\;)$.\n", "\n", "More on this later. For now, the point is that we need to pay attention to the test mse. How do we implement the algorithm above? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Step 1:\n", "First, we need to break the data up into two sets. We could do it manually (slicing!) but it is better if we let python do it for us. From scikit-learn we import some tools to do it. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pass the function the data you want in your regression and it passes back two datasets [(docs)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).\n", "\n", "If we do nothing with the random_state parameter, the function will split the data into two randomly assigned groups. If we assign a random state, then we will always get the same division of the dataset. This way we all get the same answer. \n", "\n", "We can specify how large to make the test and train samples, too. If we do not, then 25 percent of the observations are assigned to the test and 75 percent to the train. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rs = 33 # Set the random state. This ensures that we have the same answers.\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(auto[['horsepower', 'weight']], auto['mpg'], random_state = rs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(type(X_train))\n", "print('The training dataset has {0} entries and the testing dataset has {1} entries.'.format(len(X_train), len(X_test)) )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Step 2:\n", "Now we estimate (fit) the model using the training data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res_train_1 = linear_model.LinearRegression().fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('The coefficients except the intercept:', res_train_1.coef_, '\\n')\n", "print('The intercept:', res_train_1.intercept_)\n", "print('The training r-square is:', res_train_1.score(X_train, y_train))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Steps 3 & 4: \n", "We next predict using the testing data, and find the training $r^2$.\n", "\n", "scikit learn wraps this up in one command. From the results object, we find the .score() method. Pass it the testing data and it passes back the training $r^2$. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "r2_1 = res_train_1.score(X_test, y_test)\n", "print(r2_1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this **specific split of the data**, the training r-squared is less than the testing r-squared. \n", "\n", "Let's do that again, but using a different randomly determined split. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rs = 50 # A different random sample\n", "\n", "# step 1: break up the data\n", "X_train, X_test, y_train, y_test = train_test_split(auto[['horsepower', 'weight']], auto['mpg'], random_state = rs)\n", "\n", "# step 2: train the model\n", "res_train_2 = ols.fit(X_train, y_train)\n", "print('The estimated coefficients except the intercept:', res_train_2.coef_)\n", "print('The intercept:', res_train_2.intercept_)\n", "print('The training r-square is {0:.3}'.format( res_train_2.score(X_train, y_train)))\n", "\n", "# steps 3 & 4: predict and score \n", "r2_2 = res_train_2.score(X_test, y_test)\n", "print('The testing r-squared is {0:.3}'.format(r2_2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, the testing r-squared is greater than the training r-squared. \n", "\n", "As the sample changes, so do the metrics of fit. This suggests that repeatedly estimating and validating the sample on randomly assigned testing and training sets will help average out this variation. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Practice\n", "\n", "Let's use set validation to decide if adding a quadratic horsepower term increases the fit on the test data. We will perform 20 estimations & validations of each model, and compare the average test r2s. \n", "\n", "Of course, we will not write the estimation/validation out 40 times! We can wrap our computation up in a user-defined function, and then put the whole thing in a loop. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Write a function named 'set_val' that takes the X data, the y data and a value for the random state as parameters. Inside the function, split the data into test and training sets using the random state it was passed. Estimate the model on the training data, and return the .score( ) using the testing data. \n", "\n", "2. Test your function with the call\n", "python\n", "set_val(auto[['horsepower', 'weight']], auto['mpg'], 33)\n", "\n", "\n", "The answer should be 0.656570...as it is above. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Call set_val(auto[['horsepower', 'weight']], auto['mpg'], rs) 20 times, passing a new value for rs each time (I would use a for loop). The random state values are in the list below. Save the $r^2$ score for each evaluation in a list named lin_res. \n", "\n", "python\n", "rss = [10, 2, 5, 3, 1000, 9, 38, 61, 999, 1, 25, 13, 48, 97, 11, 65, 88, 46, 55, 17]\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check: Is lin_res filled with 10 numbers? " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's do the same thing again, but add a quadratic horsepower term. \n", "\n", "4. Create a new column in the auto DataFrame that holds the squared value of 'horsepower.' Do not loop over the rows of the DataFrame to do so. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. As you did in part 3., estimate and validate the model 10 times, using the same rss vector for the random states. The call to set_val now looks like: set_val(auto[['horsepower', 'weight', 'hp2']], auto['mpg'], rs). Save the $r^2$ score for each evaluation in a list named quad_res. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6A. Compute the means of lin_res and quad_res. Does the model with the quadratic horsepower term perform better or worse than the model with only a linear horsepower term? \n", "\n", "6B. Compute the standard deviation of lin_res and quad_res." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7. Plot two histograms in the same figure (so use two axes objects). In one plot the histogram of the r-squared for the model with only a linear horsepower term. In the second plot the histogram of the r-squared for the model with both the linear and quadratic terms. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our approach looks like it is converging to a central tendency, but there is certainly some variation. We will move on to more sophisticated methods next. " ] } ], "metadata": { "celltoolbar": "Attachments", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }