{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Plotting 1\n", "\n", "We have a handle on python now: we understand the data structures and enough about working with them to move on to stuff more directly relevant to data analysis. We know how to get data into Pandas from files, how to manipulate DataFrames and how to do basic statistics. \n", "\n", "Let's work through a few more of matplotlib's basic figures. We will come back to figures a later to work on some of the more complex visualizations. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The packages" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd #load the pandas package and call it pd\n", "import matplotlib.pyplot as plt # load the pyplot set of tools from the package matplotlib. Name it plt for short.\n", "from pandas_datareader import data, wb # we are grabbing the data and wb functions from the package\n", "import datetime as dt # for time and date\n", "\n", "# This following is a jupyter magic command. It tells jupyter to insert the plots into the notebook\n", "# rather than a new window.\n", "%matplotlib inline " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bar charts\n", "Bar charts are useful for describing relatively few observations of categorical data --- meaning that one of the axes is not quantitative. [Tufte](https://www.edwardtufte.com/tufte) would complain that they have a lot of redundant ink, but they are quite popular...and Tufte is not our dictator. Although, it's always good to think about what our figures are doing for us. \n", "\n", "Bar charts are much better than pie charts for displaying the relative size of data. There are discussions of this all over the net (here is [one](http://www.storytellingwithdata.com/blog/2011/07/death-to-pie-charts) I like) but the anti-pie-chart argument boils down to: pie charts are hard to read. \n", "1. Humans are bad at judging the relative sizes of 2D spaces. They cannot tell if one slice is 10% larger than another slice.\n", "2. The MS Excel style of coloring the slice different colors creates problems. Humans judge darker colors to have larger areas. \n", "2. To get quantitative traction, people label the slices with the data values. In this case, a table of numbers is probably a better way to share the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# PPP GDP data from the penn world tables \n", "\n", "code = ['USA', 'FRA', 'JPN', 'CHN', 'IND', 'BRA', 'MEX']\n", "country = ['United States', 'France', 'Japan', 'China', 'India',\n", " 'Brazil', 'Mexico']\n", "gdppc = [53.1, 36.9, 36.3, 11.9, 5.4, 15.0, 16.5]\n", "\n", "gdp = pd.DataFrame({'gdppc': gdppc, 'country': country}, index=code)\n", "gdp" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(10,5))\n", "\n", "ax.bar(gdp.index, gdp['gdppc'], color='blue', alpha=0.25) # bar(x labels, )\n", "\n", "ax.spines['top'].set_visible(False)\n", "ax.spines['right'].set_visible(False)\n", "\n", "ax.set_ylabel('PPP GDP per capita')\n", "ax.set_title('Income per person (at purchasing power parity)')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ordering of the bars is pretty random. We could sort it poor to rich." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(10,5))\n", "\n", "gdp_sort= gdp.sort_values('gdppc')\n", "\n", "ax.bar(gdp_sort.index, gdp_sort['gdppc'], color='blue', alpha=0.25) # bar(x labels, )\n", "ax.grid(axis='y', color='white')\n", "\n", "ax.spines['top'].set_visible(False)\n", "ax.spines['right'].set_visible(False)\n", "\n", "ax.set_title('Income per person (at purchasing power parity)')\n", "ax.set_ylabel('PPP GDP per capita')\n", "\n", "plt.show()" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "Notice the use of `grid()` to specify grid lines on the y axis. I made them white, so they only show up in the bars. It's something I'm experimenting with. I'm not sure I like it. \n", "\n", "Maybe you prefer a horizontal bar chart. Same data, same approach. We need to swap all the y labels for x labels. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Practice: Bar charts\n", "Take a few minutes and try the following. Feel free to chat with those around if you get stuck. The TA and I are here, too.\n", "\n", "1. Create a horizontal bar chart. Check the documentation for `barh()`\n", "2. Fix up your figure labels, etc. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Create a new horizontal bar chart where each bar is gdp per capita relative to the United States. So USA =1, MEX = 0.31, etc. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scatter plots\n", "\n", "Scatter plots are used to compare two variables. A very common approach to visualize the correlation of two variables. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "codes = ['GDPC1', 'UNRATE'] # real gdp, unemployment rate\n", "start = dt.datetime(1970, 1, 1)\n", "fred = data.DataReader(codes, 'fred', start)\n", "\n", "fred.head()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Gremlins! The gdp data is quarterly, but the unemployment rate is monthly. Let's fix this by downsampling to quarterly frequency. The FRED datareader is really good --- the index is already a datetime object. (How would you check?)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fred_q=fred.resample('q').mean() # Create an average quarterly unemployment rate\n", "fred_q.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fred_q['gdp_gr'] = fred_q['GDPC1'].pct_change()*100 # growth rate of gdp. we've seen this a few times...\n", "fred_q['unemp_dif'] = fred_q['UNRATE'].diff() # difference takes the first difference: u(t)-u(t-1) \n", "fred_q.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(10,5))\n", " \n", "ax.scatter(fred_q.gdp_gr, fred_q.unemp_dif)\n", "\n", "ax.set_title('Okun\\'s Law in the United States' )\n", "ax.set_ylabel('change in unemployment rate')\n", "ax.set_xlabel('gdp growth rate')\n", "\n", "ax.spines['top'].set_visible(False)\n", "ax.spines['right'].set_visible(False)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Practice: Scatters\n", "Take a few minutes and try the following. Feel free to chat with those around if you get stuck. The TA and I are here, too.\n", "\n", "Let's explore some of scatter plot's options. \n", "\n", "1. Change the color of the dots to red and lighten them up using alpha \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check out the documentation for [marker styles](https://matplotlib.org/api/markers_api.html). \n", "\n", "3. Change the marker to a triangle. \n", "4. Use text() or annotate() to label the point corresponding to third quarter 2009: '2009Q3'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scatter plots are very useful and we can do a lot more with them. Places to go from here.\n", "1. Add a line of best fit. A bit clunky in matplotlib (use np's polyfit command), but not too bad. Seaborn has a regplot command that makes this dead simple. \n", "2. Make data markers different colors or sizes depending on the value of a third variable. For example, you could get some more data and color the markers for years with a repbulican president red and markers for years with democratic presidents blue. \n", "3. Other ideas? Let me know!\n" ] } ], "metadata": { "celltoolbar": "Attachments", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }