In this section we’ll discuss two of the most basic and useful statistical commands. You can do a great deal of valuable work with these commands, but our primary goal will be to help you understand how the syntax elements you learned earlier can be combined with statistical commands to do analysis.
Start by creating a do file that loads the auto data set and save it as stats.do:
capturelogcloselogusingstats.log, replaceclearallsysuse auto
-------------------------------------------------------------------------------
name: <unnamed>
log: /home/r/rdimond/kb/stata_intro/stats.log
log type: text
opened on: 26 Dec 2024, 15:02:22
(1978 automobile data)
6 Summary Statistics for Continuous Variables
summarize (or just sum) gives you summary statistics which will help you understand the distribution of continuous (quantitative) variables. Start by adding sum all by itself to your do file and runing it by pressing Ctrl-d or clicking the “play” button in the top right of your Stata window, then take a look at the output:
This gives basic summary statistics for all the variables in your data set. Note that there is nothing for make: it is a string variable so summary statistics don’t make sense. Also note that for rep78 the number of observations is 69 rather than 74. That’s because five missing values were ignored and the summary statistics calculated over the remaining 69 values of rep78. Most statistical commands take a similar approach to missing values and that’s usually what you want, so you rarely have to include special handing for missing values in statistical commands.
On the other hand, rep78 is a categorical variable, so these summary statistics don’t make a lot of sense for it.
All the syntax elements you learned earlier also work with statistical commands. To get summary statistics for just mpg, give sum a variable list:
sum mpg
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
mpg | 74 21.2973 5.785503 12 41
If you want summary statistics for just the foreign cars, add an if condition:
sum mpg if foreign
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
mpg | 22 24.77273 6.611187 14 41
If you want summary statistics of mpg for both foreign and domestic cars, calculated separately, use by:
by foreign: sum mpg
-------------------------------------------------------------------------------
-> foreign = Domestic
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
mpg | 52 19.82692 4.743297 12 34
-------------------------------------------------------------------------------
-> foreign = Foreign
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
mpg | 22 24.77273 6.611187 14 41
This is one way to compare the two groups–we’ll learn another soon.
The detail (d) option will give more information. Try:
Find the mean price of cars that get more than 25 miles per gallon. Now compare that with the mean price of cars that get 25 miles per gallon or less. Does this mean American consumers in 1978 considered high gas mileage a negative characteristic and were willing to pay more to avoid it?
Solution
sum price if mpg>25
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
price | 14 4400.286 786.7804 3299 6486
sum price if mpg<=25
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
price | 60 6577.083 3117.013 3291 15906
Cars with low gas mileage cost a lot more, on average. But maybe there’s a confounding variable involved? Consider weight:
sumweightif mpg>25
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
weight | 14 2122.857 386.0536 1760 3260
sumweightif mpg<=25
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
weight | 60 3228.667 692.283 1930 4840
Cars with low gas mileage were also a lot smaller. Perhaps American consumers in 1978 were just willing to pay more for big cars, including the cost of burning more gas.
6.1 Frequencies for Categorical Variables
tabulate (tab) will create tables of frequencies, which will help you understand the distribution of categorical variables. It can also be useful for string variables that describe categories or groups.
If you give tab a variable list with one variable it will give you a one-way table, while if you give it two variables it will give you a two-way table (i.e. crosstabs). To get an idea of what tab does, add the following to your do file and run it:
The tab command has a rich set of useful options. The missing values of rep78 were not included in the table, which makes it easy to forget they’re there. Add them with the missing option:
By default tab will show value labels, but you can override this with the nolabel option. A quick and easy way to find the values underneath the value labels is to run two tab commands, one without nolabel and one with it:
Note the key at the top, which tells you which number is which.
For this table, row answers the question “What percentage of cars with a rep78 of one are domestic?” while column answers “What percentage of domestic cars have a rep78 of one?”. cell answers “What percentage of all the cars are both domestic and have a rep78 of one?” Usually you are only interested in one of those questions and will only need one of the corresponding options.
tab has an option called sum which gives summary statistics for a given variable, calculated over the observations in each cell of the table. Try:
tab foreign, sum(mpg)
| Summary of Mileage (mpg)
Car origin | Mean Std. dev. Freq.
------------+------------------------------------
Domestic | 19.826923 4.7432972 52
Foreign | 24.772727 6.6111869 22
------------+------------------------------------
Total | 21.297297 5.7855032 74
This is another easy way to compare groups.
There’s also a chi2 option that runs a chi-squared test on a two-way table:
Here we’d reject the null hypothesis that rep78 and foreign are independent of each other, except that the expected number of observations in some of the cells is too low for a chi-squared test to be reliable.
Exercise
Use tab to compare the mean value of price associated with each rep78 category. What relationship do you see? How does that relationship change if you examine foreign and domestic cars separately? (Ignore for the moment the small sample sizes for many of the cells.)
Well, the least reliable cars are the least expensive (ignoring the fact that there are only two of them) but beyond that there’s not much of a pattern. Now add foreign:
Now we see two distinct patterns. For the domestic cars, a rep78 of 3 has the highest mean price, and both 1 and 5 are cheaper. But for foreign cars, 3 is associated with the lowest mean price, and price consistently increases with rep78.
You shouldn’t take any of this very seriously–the data set is too small to support breaking it into this many groups so we can’t say anything reliable about them–but it does illustrate how things can change when you look at different subpopulations.
This do files does not make any changes to the data set it uses, so there’s no need to save a new version of it. But to finish your do file properly you should have it close its log:
logclose
name: <unnamed>
log: /home/r/rdimond/kb/stata_intro/stats.log
log type: text
closed on: 26 Dec 2024, 15:02:22
-------------------------------------------------------------------------------
Now that you understand the basics of how statistical commands work in Stata, learning more of them will be easy.