{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Regression with discrete dependent variables\n", "\n", "files needed = ('pntsprd.dta', 'apple.dta')\n", "\n", "We continue to learn about the statsmodels package [(docs)](https://devdocs.io/statsmodels/), which provides functions for formulating and estimating statistical models. Econometrics is a prerequisite for this course, so this notebook will not address the models, per se, but will focus on how to take what you learned in econometrics class and use it in python. \n", "\n", "In this notebook we take on models in which the dependent variable is discrete. In the examples below, the dependent variable is binary (which makes it easier to visualize), but many of the techniques we demo here can be extended to dependent variables with a discrete number of values. \n", "\n", "[Here](http://www.statsmodels.org/0.6.1/examples/notebooks/generated/discrete_choice_overview.html) is a nice overview of the discrete choice models in statsmodels. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd # for data handling\n", "import numpy as np # for numerical methods and data structures\n", "import matplotlib.pyplot as plt # for plotting\n", "import seaborn as sea # advanced plotting\n", "\n", "import statsmodels.formula.api as smf # provides a way to directly spec models from formulas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading Stata data files\n", "\n", "Let's continue to work on problems from Wooldridge's textbook in econometrics. We read the (mercifully) cleaned data files using the pandas method .read_stata( ) that [reads stata files](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_stata.html). \n", "\n", "The file 'pntsprd.dta' contains data about vegas betting. The complete variable list is [here](http://fmwww.bc.edu/ec-p/data/wooldridge/pntsprd.des). We will use favwin which is equal to 1 if the favored team won and zero otherwise and spread which holds the betting spread. In this context, a spread is the number of points that the favored team must beat the unfavored team by in order to be counted as a win by the favored team. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use pandas read_stata method to get the stata formatted data file into a DataFrame.\n", "vegas = pd.read_stata('pntsprd.dta')\n", "\n", "# Take a look...so clean!\n", "vegas.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vegas.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(15,6))\n", "\n", "ax.scatter( vegas['spread'], vegas['favwin'], facecolors='none', edgecolors='red')\n", "\n", "ax.set_ylabel('favored team outcome (win = 1, loss = 0)')\n", "ax.set_xlabel('point spread')\n", "ax.set_title('The data from the point spread dataset')\n", "\n", "sea.despine(ax=ax)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### OLS\n", "\n", "We begin with the linear probability model. The model is \n", "\n", "$$\\text{Pr}(favwin=1 \\mid spread) = \\beta_0 + \\beta_1 spread + \\epsilon .$$\n", "\n", "There is nothing new here technique-wise. We are estimating this with ols. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# statsmodels adds a constant for us...\n", "res_ols = smf.ols('favwin ~ spread', data=vegas).fit()\n", "\n", "print(res_ols.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hypothesis testing with t-test\n", "If bookies were all-knowing, the spread would **exactly** account for the predictable winning probability and all we would be left with is the noise --- the intercept should be one-half. Is it true in the data? We can use the t_test( ) method of the results object to perform t-tests. \n", "\n", "The null hypothesis is $H_0: \\beta_0 = 0.5$ and the alternative hypothesis is $H_1: \\beta_0 \\neq 0.5$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t_test = res_ols.t_test('Intercept = 0.5')\n", "print(t_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Linear probability models have some problems. Perhaps the biggest one is that there is no guarantee that the predicted probability lies between zero and one! \n", "\n", "We can use the predictedvalues attribute of the results object to recover the fitted values of the y variables. Let's plot them and take a look. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(15,6))\n", "\n", "ax.scatter(vegas['spread'], res_ols.fittedvalues, facecolors='none', edgecolors='red')\n", "ax.axhline(y=1.0, color='grey', linestyle='--')\n", "\n", "ax.set_ylabel('pedict probability of winning')\n", "ax.set_xlabel('point spread')\n", "ax.set_title('Predicted winning probabilities from an OLS model')\n", "\n", "sea.despine(ax=ax, trim=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Logistic regression (logit)\n", "The logistic regression passes the linear model through a non-linear function that constrains the output to lie between zero and one. (These functions are cumulative distribution functions.) In the logistic case, the function looks like\n", "\n", "$$\\text{prob} = \\frac{\\exp \\left({\\beta_0+\\beta_1 spread}\\right)}{1+\\exp \\left({\\beta_0+\\beta_1 spread}\\right)},$$\n", "\n", "and we predict a team wins when ever $\\text{prob} \\ge 0.5$.\n", "\n", "We estimate the logit model with logit( ) method from smf in a way similar to ols. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res_log = smf.logit('favwin ~ spread', data=vegas).fit()\n", "print(res_log.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Interpreting logit coefficients is bit more complicated. The probability that a team wins is given by the expression\n", "\n", "$$\\text{prob} = \\frac{\\exp \\left({\\beta_0+\\beta_1 spread}\\right)}{1+\\exp \\left({\\beta_0+\\beta_1 spread}\\right)}$$\n", "\n", "The res_log.fittedvalues seems to hold the fitted value of $\\beta_0+\\beta_1 spread$ and not the estimated probability. Let's compute it using the exp method of numpy. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pred_probs = np.exp(res_log.fittedvalues) /( 1+np.exp(res_log.fittedvalues) )\n", "pred_probs.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the estimated probabilty of the favored team winning and the actual data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(15,6))\n", "\n", "ax.scatter(vegas['spread'], pred_probs, facecolors='none', edgecolors='red', label='predicted')\n", "ax.scatter(vegas['spread'], vegas['favwin'], facecolors='none', edgecolors='blue', label = 'data')\n", "ax.axhline(y=1.0, color='grey', linestyle='--')\n", "\n", "ax.set_ylabel('pedict probability of winning')\n", "ax.set_xlabel('point spread')\n", "ax.set_title('Predicted winning probabilities from a logit model')\n", "\n", "ax.legend(frameon=False)\n", "sea.despine(ax=ax, trim=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Practice\n", "\n", "Take a few minutes and try the following. Feel free to chat with those around you if you get stuck. The TA and I are here, too.\n", "\n", "1. Load the data 'apple.dta'. The data dictionary can be found [here](http://fmwww.bc.edu/ec-p/data/wooldridge/apple.des). The variable ecolbs is purchases of eco-friendly apples (whatever that means). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Create a variable named ecobuy that is equal to 1 if the observation has a positive purchase of eco-apples (i.e., ecolbs>0)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Estimate a linear probability model relating the probability of purchasing eco-apples to household characteristics. \n", "\n", "$$\\text{ecobuy} = \\beta_0 + \\beta_1 \\text{ecoprc} + \\beta_2 \\text{regprc} + \\beta_3 \\text{faminc} + \\beta_4 \\text{hhsize} + \\beta_5 \\text{educ} + \\beta_6 \\text{age} + \\epsilon$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. How many estimated probabilities are negative? Are greater than zero?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. Now estimate the model as a probit. A probit is similar to a logit in that we are passing the linear model through a nonlinear function. In this case, the nonlinear function is the cumulative density function of the normal distribution. \n", "\n", "$$\\text{Pr}(\\text{ecobuy}=1 \\mid X) = \\Phi \\left(\\beta_0 + \\beta_1 \\text{ecoprc} + \\beta_2 \\text{regprc} + \\beta_3 \\text{faminc} + \\beta_4 \\text{hhsize} + \\beta_5 \\text{educ} + \\beta_6 \\text{age} \\right),$$\n", "\n", "where $\\Phi( )$ is the CDF of the normal distribution. Try sfm.probit( )." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6. Compute the **marginal effects** of the coefficients at **the means** and print them out using summary(). You can get the marginal effects from the results object using .get_margeff() [(docs)](https://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.ProbitResults.get_margeff.html#statsmodels.discrete.discrete_model.ProbitResults.get_margeff). \n", "\n", "The marginal effect tells us the marginal change in predicted probability as the independent variables change." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7. Re-estimate the model as a logit model. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "8. Compute the marginal effects of the logit coefficients. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We haven't done much data wrangling lately. I'm feeling a bit sad; I miss shaping data. \n", "\n", "9. Create a pandas DataFrame with the row index 'ecoprc', 'regprc', 'faminc', 'hhsize', 'educ', and 'age'. The columns should be labeled 'logit', 'probit', and 'ols'. The columns should contain the marginal effects for the logit and probit models and the coefficients from the ols model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Attachments", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }