{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Instrumental variables regression\n", "\n", "files needed = ('card.dta', )\n", "\n", "We introduce the **linearmodels** package in this notebook [(docs)](https://bashtage.github.io/linearmodels/doc/). The linearmodels package is built on top of statsmodels and provides us with extended functionality, including instrumental variables regressions and panel regressions. \n", "\n", "We will use linearmodels to perform instrumental variables regressions (two stage least squares). Along the way, we will practice using ols from statsmodels a bit, too. \n", "\n", "We need to install linearmodels. Open a command window and type \n", "python\n", "pip install --user linearmodels\n", "\n", "then hit enter. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd # for data handling\n", "import numpy as np # for numerical methods and data structures\n", "import matplotlib.pyplot as plt # for plotting\n", "import seaborn as sea # advanced plotting\n", "\n", "import statsmodels.formula.api as smf # provides a way to directly spec models from formulas\n", "import linearmodels.iv as iv # built on statsmodels, this package provides iv in an easy way " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Medical expenditure panel survey\n", "\n", "Let's use data from the [medical expenditure panel survey](https://meps.ahrq.gov/mepsweb/). It surveys households, medical providers, and employers. There is a lot of interesting data here. \n", "\n", "The linearmodels package includes a dataset based on this data. Let's load it. You can see the variable list definitions [here](https://bashtage.github.io/linearmodels/doc/iv/examples/using-formulas.html). The example below is based on the example they provide. The data cover people over the age of 65.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from linearmodels.datasets import meps\n", "med_exp = meps.load()\n", "med_exp.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# It's clean! \n", "med_exp.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# But there are a few nans. Drop them.\n", "med_exp = med_exp.dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### OLS\n", "We would like to understand the impact of employer (or union) provided healthcare on drug expenditure. \n", "\n", "$$\\log(\\text{drugexp}) = \\beta_0 + \\beta_1 \\text{linc} + \\beta_2 \\text{hi_empunion} + \\epsilon .$$\n", "\n", "Start with ols." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res_ols = smf.ols('np.log(drugexp) ~ linc + hi_empunion', data=med_exp).fit(cov_type = 'HC3')\n", "print(res_ols.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The coefficient on employer health insurance is positive. People who had insurance through work spend more on drugs in their old age...or do they?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Instrumental variables \n", "\n", "The concern is that *hi_empunion* is not exogenous. We will use *ssiratio* as an instrument for *hi_emp_union*. The purpose of the instrument is to remove the endogeneity of *hi_empunion.* A good instrument is\n", "\n", "1. Correlated with the endogenous variable (in this case, *hi_empunion*) \n", "2. Uncorrelated with the error term\n", "\n", "We can test the first condition. Regress \n", "\n", "$$\\text{hi_empunion} = \\beta_1 \\text{ssiratio} + \\nu .$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res_s1 = smf.ols('hi_empunion ~ ssiratio + linc', data=med_exp).fit(cov_type = 'HC3')\n", "print(res_s1.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's significant, so it passes the the first test. The second test cannot generally be checked. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Two stage least squares\n", "\n", "Note that our test of condition 1 generates fitted variables that are washed of their endogeneity. We can use the fitted variables in our original regression as an independent variable,\n", "\n", "$$\\log(\\text{drugexp}) = \\beta_0 + \\beta_1 \\text{linc} + \\beta_2 \\widehat{\\text{hi_empunion}} + \\epsilon .$$\n", "\n", "\n", "We refer to the regression of *hi_empunion* on *ssiratio* as the **first stage regression** and the regression above as the **second stage regression**. In general, the two step procedure is\n", "\n", "1. Regress the endogenous variable on the instrument(s)\n", "2. Estimate the original model, after replacing the endogenous variable with the fitted values from step 1.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "med_exp['fitted'] = res_s1.fittedvalues\n", "med_exp['fitted'].describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res_s2 = smf.ols('np.log(drugexp) ~ linc + fitted', data=med_exp).fit(cov_type = 'HC3')\n", "print(res_s2.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are two problems with this procedure. 1. It's tedious. 2. The standard errors are not correct. \n", "\n", "Both of these problems are solved by using a two state least squares estimator. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2SLS with linearmodels\n", "\n", "We modify the patsy notation by specifying the endogenous variable and the instrument in square brackets. A couple minor differences in syntax compared to statsmodels. \n", "\n", "1. We need the .from_formula method if we are not specifying the design matrices directly. \n", "2. We need to include 1 to get the constant\n", "3. There is not a named parameter data. We have to pass from_formula( formual, data)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res_iv = iv.IV2SLS.from_formula('np.log(drugexp) ~ 1 + linc + [hi_empunion ~ ssiratio]', med_exp).fit()\n", "print(res_iv.summary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The point estimates are the same, but the standard errors differ. We can see the first-stage regression. Compare them to our first stage above. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(res_iv.first_stage)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Practice\n", "\n", "Take a few minutes and try the following. Feel free to chat with those around you if you get stuck. The TA and I are here, too.\n", "\n", "The file 'card.dta' contains data about individuals' wages and characteristics. The complete variable list is [here](http://fmwww.bc.edu/ec-p/data/wooldridge/card.des). \n", "\n", "1. Load the data. Make sure everything looks okay." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Scatter plot log(wage) (y-axis) and educ (x-axis). \$The log is the natural log, np.log()\$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Regress wages on education. Use OLS. \n", "\n", "$$\\log(\\text{wage}) = \\beta_0 + \\beta_1 \\text{educ} + \\epsilon .$$\n", "\n", "Make life interesting. Do not use the *lwage* variable. Roll your own. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Consider the variable *nearc4* as an instrument. The variable is equal to 1 if the person grew up near a bachelor degree granting institution. \n", "\n", "4. Scatter plot *educ* (y-axis) and *nearc4* (x-axis). Is this plot very useful? \$If you finish early, come back to this plot and see if you can improve on it.\$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. Does *nearc4* satisfy the first condition for *nearc4* to be a good instrument for education?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6. Estimate \n", "\n", "$$\\log(\\text{wage}) = \\beta_0 + \\beta_1 \\text{educ} + \\epsilon .$$\n", "\n", "using *nearc4* as an instrument for *educ*. Print out the results summary.\n", "\n", "Make life interesting. Do not use the *lwage* variable. Roll your own. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7. Go nuts. Estimate \n", "\n", "$$\\log(\\text{wage}) = \\beta_0 + \\beta_1 \\text{exper} + \\beta_2 \\text{exper}^2 + \\beta_3 \\text{black} + \\beta_4 \\text{smsa}+ \\beta_5 \\text{south} + \\beta_6 \\text{educ} + \\epsilon .$$\n", "\n", "Instrument *educ* with both *nearc4* and *nearc2*. \n", "\n", "Make life interesting. Do not use the *expersq* or *lwage* variables. Roll your own. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Attachments", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }