clear all
use https://sscc.wisc.edu/sscc/pubs/stata_tables/acs
1 Tables
The table
command allows you to create simple tables very easily, and fairly complex ones with more effort. Learning to use table
is a great introduction to the concepts used by collect
, but even if you never go on to collect
you’ll find learning table
worthwhile.
Start up Stata, open a new do file, and have it open an extract from the 2000 ACS data set that’s hosted on the SSCC web site. It contains people of working age with positie incomes, since we’ll run models of income later.
1.1 Table Structure
The starting point for any table is its structure. Almost every table will contain rows and/or columns. The simplest tables associate variables with the rows or columns, so you get a row or column for each level of that variable. (These will generally be categorical variables, or your table is likely to be unwieldy.)
To create a table with rows defined by a variable, use table
followed by the variable name in parentheses:
table (marital_status)
----------------------------
| Frequency
----------------+-----------
Marital Status |
Now married | 8,730
Widowed | 356
Divorced | 1,852
Separated | 377
Never married | 3,843
Total | 15,158
----------------------------
By default the table cells will contain frequencies, but table
can do much more.
To add columns, add the name of the variable that defines the columns, again in parentheses:
table (marital_status) (female)
------------------------------------------
| Female
| Male Female Total
----------------+-------------------------
Marital Status |
Now married | 4,670 4,060 8,730
Widowed | 79 277 356
Divorced | 766 1,086 1,852
Separated | 144 233 377
Never married | 2,112 1,731 3,843
Total | 7,771 7,387 15,158
------------------------------------------
To create a table with multiple columns but just one row, use empty parentheses for the row:
table () (female)
------------------------------------
| Female
| Male Female Total
----------+-------------------------
Frequency | 7,771 7,387 15,158
------------------------------------
Exercise 1
Make a table containing frequencies (crosstabs) for hispanic
and race
. Which variable works best as the row variable? (Solution)
1.2 Dimensions and Levels
You can generalize this concept by thinking of the rows and columns of the table as dimensions. So far we have specified tables with one or two dimensions; what happens if we add more?
table (marital_status) (female) (hispanic)
Person is Hispanic = Not Hispanic
------------------------------------------
| Female
| Male Female Total
----------------+-------------------------
Marital Status |
Now married | 4,102 3,660 7,762
Widowed | 71 255 326
Divorced | 704 997 1,701
Separated | 117 190 307
Never married | 1,794 1,511 3,305
Total | 6,788 6,613 13,401
------------------------------------------
Person is Hispanic = Hispanic
----------------------------------------
| Female
| Male Female Total
----------------+-----------------------
Marital Status |
Now married | 568 400 968
Widowed | 8 22 30
Divorced | 62 89 151
Separated | 27 43 70
Never married | 318 220 538
Total | 983 774 1,757
----------------------------------------
Person is Hispanic = Total
------------------------------------------
| Female
| Male Female Total
----------------+-------------------------
Marital Status |
Now married | 4,670 4,060 8,730
Widowed | 79 277 356
Divorced | 766 1,086 1,852
Separated | 144 233 377
Never married | 2,112 1,731 3,843
Total | 7,771 7,387 15,158
------------------------------------------
The third dimension is used to create separate tables. (Sorry, no four dimensional tables!)
Each dimension has a set of levels associated with it. Take the table:
table (marital_status) (female)
------------------------------------------
| Female
| Male Female Total
----------------+-------------------------
Marital Status |
Now married | 4,670 4,060 8,730
Widowed | 79 277 356
Divorced | 766 1,086 1,852
Separated | 144 233 377
Never married | 2,112 1,731 3,843
Total | 7,771 7,387 15,158
------------------------------------------
This has the dimensions marital_status
and female
. (It also has other dimensions that you’ll learn about later.) The layout of the table specifies that marital_status
is the rows and female
the columns.
The marital_status
dimension has one level for each level of the variable marital_status
, plus one for the totals. The level names come from the values of marital_status
rather than the value labels, so they are 1, 2, 3, 4, and 5, plus .m for the totals. (You’ll learn later how to find this out for yourself.)
You can refer to a specific level of a specific dimension with the syntax dimension[level]
, for example marital_status[1]
. You can also list multiple levels in the brackets. One use for this syntax is to specify that you only want to include certain levels in the table. For example, if you only want to include “Now married” and “Never married” in the table, you can do so with:
table (marital_status[1 5]) (female)
-----------------------------------------
| Female
| Male Female Total
----------------+------------------------
Marital Status |
Now married | 4,670 4,060 8,730
Never married | 2,112 1,731 3,843
-----------------------------------------
The totals for marital_status
are omitted because they’re just another level and not in the list of levels you asked for.
If you just want to omit the totals, you can do so with the nototal
option:
table (marital_status) (female), nototal
---------------------------------
| Female
| Male Female
----------------+----------------
Marital Status |
Now married | 4,670 4,060
Widowed | 79 277
Divorced | 766 1,086
Separated | 144 233
Never married | 2,112 1,731
---------------------------------
Alternatively, you can use the total()
option to specify you want totals for one dimension and then they will be omitted for the other:
table (marital_status) (female), total(female)
---------------------------------
| Female
| Male Female
----------------+----------------
Marital Status |
Now married | 4,670 4,060
Widowed | 79 277
Divorced | 766 1,086
Separated | 144 233
Never married | 2,112 1,731
Total | 7,771 7,387
---------------------------------
Note how totals calculated across female
add a level to the marital_status
dimension.
Exercise 2
Create a crosstab of race
and hispanic
again, but this time only include White, Black, Asian, and Other. Include a total row but not a total column. (Solution)
1.3 Super Rows and Super Columns
Super rows and super columns are rows and columns that contain other rows and columns. You specify them by listing two variables for a dimension. For example:
table (marital_status) (hispanic female), nototal
-------------------------------------------------
| Person is Hispanic
| Not Hispanic Hispanic
| Female Female
| Male Female Male Female
----------------+--------------------------------
Marital Status |
Now married | 4,102 3,660 568 400
Widowed | 71 255 8 22
Divorced | 704 997 62 89
Separated | 117 190 27 43
Never married | 1,794 1,511 318 220
-------------------------------------------------
The first variable listed becomes “super.” Changing the order gives the same results, but organized differently:
table (marital_status) (female hispanic), nototal
--------------------------------------------------------------------
| Female
| Male Female
| Person is Hispanic Person is Hispanic
| Not Hispanic Hispanic Not Hispanic Hispanic
----------------+---------------------------------------------------
Marital Status |
Now married | 4,102 568 3,660 400
Widowed | 71 8 255 22
Divorced | 704 62 997 89
Separated | 117 27 190 43
Never married | 1,794 318 1,511 220
--------------------------------------------------------------------
Choose the organization that puts the numbers your readers are most likely to compare close to each other.
Tables with super rows can get long quickly:
table (marital_status hispanic) (female), nototal
----------------------------------------
| Female
| Male Female
-----------------------+----------------
Marital Status |
Now married |
Person is Hispanic |
Not Hispanic | 4,102 3,660
Hispanic | 568 400
Widowed |
Person is Hispanic |
Not Hispanic | 71 255
Hispanic | 8 22
Divorced |
Person is Hispanic |
Not Hispanic | 704 997
Hispanic | 62 89
Separated |
Person is Hispanic |
Not Hispanic | 117 190
Hispanic | 27 43
Never married |
Person is Hispanic |
Not Hispanic | 1,794 1,511
Hispanic | 318 220
----------------------------------------
1.4 Adding Statistics
By default the table command puts frequencies in the table cells, i.e. the number of observations that have that particular combination of values. You can change that with the stat()
option, including calculating multiple statistics. Start with a table that has female
in the column dimension (we’ll add rows soon):
table () (female)
------------------------------------
| Female
| Male Female Total
----------+-------------------------
Frequency | 7,771 7,387 15,158
------------------------------------
This is equivalent to using the stat(frequency)
option:
table () (female), stat(frequency)
------------------------------------
| Female
| Male Female Total
----------+-------------------------
Frequency | 7,771 7,387 15,158
------------------------------------
Now add percentages by adding stat(percent)
:
table () (female), stat(frequency) stat(percent)
------------------------------------
| Female
| Male Female Total
----------+-------------------------
Frequency | 7,771 7,387 15,158
Percent | 51.27 48.73 100.00
------------------------------------
While the frequency
and percent
statistics are based on female
itself, you can also include statistics for other variables. To do so, use the stat()
option and specify a statistic like before, but then include the variable or variables the statistic is to be calculated for in the parentheses as well. For example, you can add the means of age and income with stat(mean age income)
and their standard deviations with stat(sd mean income)
. This command is getting long, so put one stat()
option per line for readability:
table () (female), ///
frequency) ///
stat(percent) ///
stat(mean age income) ///
stat(sd age income) stat(
----------------------------------------------------
| Female
| Male Female Total
-------------------+--------------------------------
Frequency | 7,771 7,387 15,158
Percent | 51.27 48.73 100.00
Mean |
Age | 40.10398 40.20441 40.15292
Income | 42115.94 24718.83 33637.75
Standard deviation |
Age | 12.5318 12.64044 12.58455
Income | 49102 27538.59 41001.73
----------------------------------------------------
(Yes, we’d rather have the mean and standard deviation of each variable together. We’ll learn how soon.)
You can add frequencies for another variable with the fvfreq
statistic, or “factor variable frequency.” This is distinguishes it from frequency
(or just freq
) which refers to the variable defining the dimension.
table () (female), ///
frequency) ///
stat(percent) ///
stat(mean age income) ///
stat(sd age income) ///
stat( stat(fvfreq marital_status)
----------------------------------------------------------------
| Female
| Male Female Total
-------------------------------+--------------------------------
Frequency | 7,771 7,387 15,158
Percent | 51.27 48.73 100.00
Mean |
Age | 40.10398 40.20441 40.15292
Income | 42115.94 24718.83 33637.75
Standard deviation |
Age | 12.5318 12.64044 12.58455
Income | 49102 27538.59 41001.73
Factor-variable frequency |
Marital Status=Now married | 4,670 4,060 8,730
Marital Status=Widowed | 79 277 356
Marital Status=Divorced | 766 1,086 1,852
Marital Status=Separated | 144 233 377
Marital Status=Never married | 2,112 1,731 3,843
----------------------------------------------------------------
Now this is starting to look like “Table 1” from a paper!
Note that while we left the row dimension blank, our table not only has multiple rows, it has rows and super rows. What are they, and where did they come from? When you use the stat()
option, table
automatically creates a new dimension called result
, with one level for each result that’s calculated. If your stat
() option includes multiple variables, an additional dimension is created called var
that contains the variables. For fvfreq
this includes indicators for each level of the variable (1.marital_status, 2.marital_status, etc.).
table
then assumes that you want the result
dimension and the var
dimension in your table and finds a place for them: in this case the empty row dimension. It also made result
the super row. But you can take control by specifying where to put them. What we did before is equivalent to:
table (result var) (female), ///
frequency) ///
stat(percent) ///
stat(mean age income) ///
stat(sd age income) ///
stat( stat(fvfreq marital_status)
----------------------------------------------------------------
| Female
| Male Female Total
-------------------------------+--------------------------------
Frequency | 7,771 7,387 15,158
Percent | 51.27 48.73 100.00
Mean |
Age | 40.10398 40.20441 40.15292
Income | 42115.94 24718.83 33637.75
Standard deviation |
Age | 12.5318 12.64044 12.58455
Income | 49102 27538.59 41001.73
Factor-variable frequency |
Marital Status=Now married | 4,670 4,060 8,730
Marital Status=Widowed | 79 277 356
Marital Status=Divorced | 766 1,086 1,852
Marital Status=Separated | 144 233 377
Marital Status=Never married | 2,112 1,731 3,843
----------------------------------------------------------------
Reversing the order of result
and var
will make a super row for each variable and put their mean and standard deviation together–almost certainly what you want:
table (var result) (female), ///
frequency) ///
stat(percent) ///
stat(mean age income) ///
stat(sd age income) ///
stat( stat(fvfreq marital_status)
--------------------------------------------------------------
| Female
| Male Female Total
-----------------------------+--------------------------------
Frequency | 7,771 7,387 15,158
Percent | 51.27 48.73 100.00
Age |
Mean | 40.10398 40.20441 40.15292
Standard deviation | 12.5318 12.64044 12.58455
Income |
Mean | 42115.94 24718.83 33637.75
Standard deviation | 49102 27538.59 41001.73
Marital Status=Now married |
Factor-variable frequency | 4,670 4,060 8,730
Marital Status=Widowed |
Factor-variable frequency | 79 277 356
Marital Status=Divorced |
Factor-variable frequency | 766 1,086 1,852
Marital Status=Separated |
Factor-variable frequency | 144 233 377
Marital Status=Never married |
Factor-variable frequency | 2,112 1,731 3,843
--------------------------------------------------------------
On the other hand it makes a real mess out of fvfreq
. There are settings you can apply that will fix it, but many such settings have been collected into a style called table-1
. It’s designed for the tables of summary statistics that make up the “Table 1” of so many papers. Apply it with the style()
option:
table (var result) (female), ///
frequency) ///
stat(percent) ///
stat(mean age income) ///
stat(sd age income) ///
stat(///
stat(fvfreq marital_status) table-1) style(
-------------------------------------------------
| Female
| Male Female Total
----------------+--------------------------------
| 7,771 7,387 15,158
| 51.27 48.73 100.00
|
Age | 40.10398 40.20441 40.15292
| 12.5318 12.64044 12.58455
|
Income | 42115.94 24718.83 33637.75
| 49102 27538.59 41001.73
|
Marital Status |
Now married | 4,670 4,060 8,730
Widowed | 79 277 356
Divorced | 766 1,086 1,852
Separated | 144 233 377
Never married | 2,112 1,731 3,843
-------------------------------------------------
Better, but it still needs some polishing. You’ll learn how to change the appearance of a table in the Appearance chapter.
If you want to use the above table in a Word document, you can save it in Word format by exporting the collection that the table command created automatically. We’ll have lots more to say about collections in the next chapter.
export table1.docx, replace collect
(collection Table exported to file table1.docx)
You can export tables to many other formats as well, including Excel, HTML, PDF, and LaTeX.
Exercise 3
Create a similar table with hispanic
for the columns, and summary statistics for age
, income
, and edu
. For income
and age
, report the 10th, 25th, 50th, 75th, and 90th percentiles rather than the mean and standard deviation, but don’t worry about labeling them. (Hint: in the stat()
option, p#
requests the #th percentile.) Report overall frequencies but not percentages. (Solution)