5  Statistics

In this section we’ll discuss two of the most basic and useful statistical commands. You can do a great deal of valuable work with these commands, but our primary goal will be to help you understand how the syntax elements you learned earlier can be combined with statistical commands to do analysis.

Start by creating a do file that loads the auto data set and save it as stats.do:

capture log close
log using stats.log, replace

clear all
sysuse auto
-------------------------------------------------------------------------------
      name:  <unnamed>
       log:  /home/r/rdimond/kb/stata_intro/stats.log
  log type:  text
 opened on:  26 Dec 2024, 15:02:22
(1978 automobile data)

6 Summary Statistics for Continuous Variables

summarize (or just sum) gives you summary statistics which will help you understand the distribution of continuous (quantitative) variables. Start by adding sum all by itself to your do file and runing it by pressing Ctrl-d or clicking the “play” button in the top right of your Stata window, then take a look at the output:

sum

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
        make |          0
       price |         74    6165.257    2949.496       3291      15906
         mpg |         74     21.2973    5.785503         12         41
       rep78 |         69    3.405797    .9899323          1          5
    headroom |         74    2.993243    .8459948        1.5          5
-------------+---------------------------------------------------------
       trunk |         74    13.75676    4.277404          5         23
      weight |         74    3019.459    777.1936       1760       4840
      length |         74    187.9324    22.26634        142        233
        turn |         74    39.64865    4.399354         31         51
displacement |         74    197.2973    91.83722         79        425
-------------+---------------------------------------------------------
  gear_ratio |         74    3.014865    .4562871       2.19       3.89
     foreign |         74    .2972973    .4601885          0          1

This gives basic summary statistics for all the variables in your data set. Note that there is nothing for make: it is a string variable so summary statistics don’t make sense. Also note that for rep78 the number of observations is 69 rather than 74. That’s because five missing values were ignored and the summary statistics calculated over the remaining 69 values of rep78. Most statistical commands take a similar approach to missing values and that’s usually what you want, so you rarely have to include special handing for missing values in statistical commands.

On the other hand, rep78 is a categorical variable, so these summary statistics don’t make a lot of sense for it.

All the syntax elements you learned earlier also work with statistical commands. To get summary statistics for just mpg, give sum a variable list:

sum mpg

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         mpg |         74     21.2973    5.785503         12         41

If you want summary statistics for just the foreign cars, add an if condition:

sum mpg if foreign

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         mpg |         22    24.77273    6.611187         14         41

If you want summary statistics of mpg for both foreign and domestic cars, calculated separately, use by:

by foreign: sum mpg

-------------------------------------------------------------------------------
-> foreign = Domestic

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         mpg |         52    19.82692    4.743297         12         34

-------------------------------------------------------------------------------
-> foreign = Foreign

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         mpg |         22    24.77273    6.611187         14         41

This is one way to compare the two groups–we’ll learn another soon.

The detail (d) option will give more information. Try:

sum mpg, detail

                        Mileage (mpg)
-------------------------------------------------------------
      Percentiles      Smallest
 1%           12             12
 5%           14             12
10%           14             14       Obs                  74
25%           18             14       Sum of wgt.          74

50%           20                      Mean            21.2973
                        Largest       Std. dev.      5.785503
75%           25             34
90%           29             35       Variance       33.47205
95%           34             35       Skewness       .9487176
99%           41             41       Kurtosis       3.975005
Exercise

Find the mean price of cars that get more than 25 miles per gallon. Now compare that with the mean price of cars that get 25 miles per gallon or less. Does this mean American consumers in 1978 considered high gas mileage a negative characteristic and were willing to pay more to avoid it?

sum price if mpg>25

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       price |         14    4400.286    786.7804       3299       6486
sum price if mpg<=25

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       price |         60    6577.083    3117.013       3291      15906

Cars with low gas mileage cost a lot more, on average. But maybe there’s a confounding variable involved? Consider weight:

sum weight if mpg>25

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      weight |         14    2122.857    386.0536       1760       3260
sum weight if mpg<=25

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      weight |         60    3228.667     692.283       1930       4840

Cars with low gas mileage were also a lot smaller. Perhaps American consumers in 1978 were just willing to pay more for big cars, including the cost of burning more gas.

6.1 Frequencies for Categorical Variables

tabulate (tab) will create tables of frequencies, which will help you understand the distribution of categorical variables. It can also be useful for string variables that describe categories or groups.

If you give tab a variable list with one variable it will give you a one-way table, while if you give it two variables it will give you a two-way table (i.e. crosstabs). To get an idea of what tab does, add the following to your do file and run it:

tab rep78

     Repair |
record 1978 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2        2.90        2.90
          2 |          8       11.59       14.49
          3 |         30       43.48       57.97
          4 |         18       26.09       84.06
          5 |         11       15.94      100.00
------------+-----------------------------------
      Total |         69      100.00
tab rep78 foreign

    Repair |
    record |      Car origin
      1978 |  Domestic    Foreign |     Total
-----------+----------------------+----------
         1 |         2          0 |         2 
         2 |         8          0 |         8 
         3 |        27          3 |        30 
         4 |         9          9 |        18 
         5 |         2          9 |        11 
-----------+----------------------+----------
     Total |        48         21 |        69 

Tables are usually easier to read if the variable with the most unique values comes first, so they’re listed vertically.

If you’re interested in frequencies across more than two variables, check out the table command.

The tab command has a rich set of useful options. The missing values of rep78 were not included in the table, which makes it easy to forget they’re there. Add them with the missing option:

tab rep78, missing

     Repair |
record 1978 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2        2.70        2.70
          2 |          8       10.81       13.51
          3 |         30       40.54       54.05
          4 |         18       24.32       78.38
          5 |         11       14.86       93.24
          . |          5        6.76      100.00
------------+-----------------------------------
      Total |         74      100.00

By default tab will show value labels, but you can override this with the nolabel option. A quick and easy way to find the values underneath the value labels is to run two tab commands, one without nolabel and one with it:

tab foreign

 Car origin |      Freq.     Percent        Cum.
------------+-----------------------------------
   Domestic |         52       70.27       70.27
    Foreign |         22       29.73      100.00
------------+-----------------------------------
      Total |         74      100.00
tab foreign, nolabel

 Car origin |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         52       70.27       70.27
          1 |         22       29.73      100.00
------------+-----------------------------------
      Total |         74      100.00

To get percentages in a two-way table add the row, column, or cell options:

tab rep78 foreign, row column cell

+-------------------+
| Key               |
|-------------------|
|     frequency     |
|  row percentage   |
| column percentage |
|  cell percentage  |
+-------------------+

    Repair |
    record |      Car origin
      1978 |  Domestic    Foreign |     Total
-----------+----------------------+----------
         1 |         2          0 |         2 
           |    100.00       0.00 |    100.00 
           |      4.17       0.00 |      2.90 
           |      2.90       0.00 |      2.90 
-----------+----------------------+----------
         2 |         8          0 |         8 
           |    100.00       0.00 |    100.00 
           |     16.67       0.00 |     11.59 
           |     11.59       0.00 |     11.59 
-----------+----------------------+----------
         3 |        27          3 |        30 
           |     90.00      10.00 |    100.00 
           |     56.25      14.29 |     43.48 
           |     39.13       4.35 |     43.48 
-----------+----------------------+----------
         4 |         9          9 |        18 
           |     50.00      50.00 |    100.00 
           |     18.75      42.86 |     26.09 
           |     13.04      13.04 |     26.09 
-----------+----------------------+----------
         5 |         2          9 |        11 
           |     18.18      81.82 |    100.00 
           |      4.17      42.86 |     15.94 
           |      2.90      13.04 |     15.94 
-----------+----------------------+----------
     Total |        48         21 |        69 
           |     69.57      30.43 |    100.00 
           |    100.00     100.00 |    100.00 
           |     69.57      30.43 |    100.00 

Note the key at the top, which tells you which number is which.

For this table, row answers the question “What percentage of cars with a rep78 of one are domestic?” while column answers “What percentage of domestic cars have a rep78 of one?”. cell answers “What percentage of all the cars are both domestic and have a rep78 of one?” Usually you are only interested in one of those questions and will only need one of the corresponding options.

tab has an option called sum which gives summary statistics for a given variable, calculated over the observations in each cell of the table. Try:

tab foreign, sum(mpg)

            |      Summary of Mileage (mpg)
 Car origin |        Mean   Std. dev.       Freq.
------------+------------------------------------
   Domestic |   19.826923   4.7432972          52
    Foreign |   24.772727   6.6111869          22
------------+------------------------------------
      Total |   21.297297   5.7855032          74

This is another easy way to compare groups.

There’s also a chi2 option that runs a chi-squared test on a two-way table:

tab rep78 foreign, chi2

    Repair |
    record |      Car origin
      1978 |  Domestic    Foreign |     Total
-----------+----------------------+----------
         1 |         2          0 |         2 
         2 |         8          0 |         8 
         3 |        27          3 |        30 
         4 |         9          9 |        18 
         5 |         2          9 |        11 
-----------+----------------------+----------
     Total |        48         21 |        69 

          Pearson chi2(4) =  27.2640   Pr = 0.000

Here we’d reject the null hypothesis that rep78 and foreign are independent of each other, except that the expected number of observations in some of the cells is too low for a chi-squared test to be reliable.

Exercise

Use tab to compare the mean value of price associated with each rep78 category. What relationship do you see? How does that relationship change if you examine foreign and domestic cars separately? (Ignore for the moment the small sample sizes for many of the cells.)

tab rep78, sum(price)

     Repair |          Summary of Price
record 1978 |        Mean   Std. dev.       Freq.
------------+------------------------------------
          1 |     4,564.5   522.55191           2
          2 |   5,967.625   3,579.357           8
          3 |   6,429.233    3,525.14          30
          4 |     6,071.5   1,709.608          18
          5 |       5,913   2,615.763          11
------------+------------------------------------
      Total |   6,146.043    2,912.44          69

Well, the least reliable cars are the least expensive (ignoring the fact that there are only two of them) but beyond that there’s not much of a pattern. Now add foreign:

tab rep78 foreign, sum(price)

            Means, Standard Deviations and Frequencies of Price

    Repair |
    record |     Car origin
      1978 |  Domestic    Foreign |     Total
-----------+----------------------+----------
         1 |   4,564.5          . |   4,564.5
           | 522.55191          . | 522.55191
           |         2          0 |         2
-----------+----------------------+----------
         2 | 5,967.625          . | 5,967.625
           | 3,579.357          . | 3,579.357
           |         8          0 |         8
-----------+----------------------+----------
         3 | 6,607.074  4,828.667 | 6,429.233
           | 3,661.267  1,285.613 |  3,525.14
           |        27          3 |        30
-----------+----------------------+----------
         4 | 5,881.556  6,261.444 |   6,071.5
           | 1,592.019  1,896.092 | 1,709.608
           |         9          9 |        18
-----------+----------------------+----------
         5 |   4,204.5  6,292.667 |     5,913
           | 311.83409  2,765.629 | 2,615.763
           |         2          9 |        11
-----------+----------------------+----------
     Total |  6,179.25  6,070.143 | 6,146.043
           | 3,188.969  2,220.984 |  2,912.44
           |        48         21 |        69

Now we see two distinct patterns. For the domestic cars, a rep78 of 3 has the highest mean price, and both 1 and 5 are cheaper. But for foreign cars, 3 is associated with the lowest mean price, and price consistently increases with rep78.

You shouldn’t take any of this very seriously–the data set is too small to support breaking it into this many groups so we can’t say anything reliable about them–but it does illustrate how things can change when you look at different subpopulations.

This do files does not make any changes to the data set it uses, so there’s no need to save a new version of it. But to finish your do file properly you should have it close its log:

log close
      name:  <unnamed>
       log:  /home/r/rdimond/kb/stata_intro/stats.log
  log type:  text
 closed on:  26 Dec 2024, 15:02:22
-------------------------------------------------------------------------------

Now that you understand the basics of how statistical commands work in Stata, learning more of them will be easy.