1 Tables

The table command allows you to create simple tables very easily, and fairly complex ones with more effort. Learning to use table is a great introduction to the concepts used by collect, but even if you never go on to collect you’ll find learning table worthwhile.

Start up Stata, open a new do file, and have it open an extract from the 2000 ACS data set that’s hosted on the SSCC web site. It contains people of working age with positie incomes, since we’ll run models of income later.

clear all
use https://sscc.wisc.edu/sscc/pubs/stata_tables/acs

1.1 Table Structure

The starting point for any table is its structure. Almost every table will contain rows and/or columns. The simplest tables associate variables with the rows or columns, so you get a row or column for each level of that variable. (These will generally be categorical variables, or your table is likely to be unwieldy.)

To create a table with rows defined by a variable, use table followed by the variable name in parentheses:

table (marital_status)


----------------------------
                |  Frequency
----------------+-----------
Marital Status  |           
  Now married   |      8,730
  Widowed       |        356
  Divorced      |      1,852
  Separated     |        377
  Never married |      3,843
  Total         |     15,158
----------------------------

By default the table cells will contain frequencies, but table can do much more.

To add columns, add the name of the variable that defines the columns, again in parentheses:

table (marital_status) (female)


------------------------------------------
                |           Female        
                |   Male   Female    Total
----------------+-------------------------
Marital Status  |                         
  Now married   |  4,670    4,060    8,730
  Widowed       |     79      277      356
  Divorced      |    766    1,086    1,852
  Separated     |    144      233      377
  Never married |  2,112    1,731    3,843
  Total         |  7,771    7,387   15,158
------------------------------------------

To create a table with multiple columns but just one row, use empty parentheses for the row:

table () (female)


------------------------------------
          |           Female        
          |   Male   Female    Total
----------+-------------------------
Frequency |  7,771    7,387   15,158
------------------------------------

Exercise 1

Make a table containing frequencies (crosstabs) for hispanic and race. Which variable works best as the row variable? (Solution)

1.2 Dimensions and Levels

You can generalize this concept by thinking of the rows and columns of the table as dimensions. So far we have specified tables with one or two dimensions; what happens if we add more?

table (marital_status) (female) (hispanic)


Person is Hispanic = Not Hispanic
------------------------------------------
                |           Female        
                |   Male   Female    Total
----------------+-------------------------
Marital Status  |                         
  Now married   |  4,102    3,660    7,762
  Widowed       |     71      255      326
  Divorced      |    704      997    1,701
  Separated     |    117      190      307
  Never married |  1,794    1,511    3,305
  Total         |  6,788    6,613   13,401
------------------------------------------

Person is Hispanic = Hispanic
----------------------------------------
                |          Female       
                |  Male   Female   Total
----------------+-----------------------
Marital Status  |                       
  Now married   |   568      400     968
  Widowed       |     8       22      30
  Divorced      |    62       89     151
  Separated     |    27       43      70
  Never married |   318      220     538
  Total         |   983      774   1,757
----------------------------------------

Person is Hispanic = Total
------------------------------------------
                |           Female        
                |   Male   Female    Total
----------------+-------------------------
Marital Status  |                         
  Now married   |  4,670    4,060    8,730
  Widowed       |     79      277      356
  Divorced      |    766    1,086    1,852
  Separated     |    144      233      377
  Never married |  2,112    1,731    3,843
  Total         |  7,771    7,387   15,158
------------------------------------------

The third dimension is used to create separate tables. (Sorry, no four dimensional tables!)

Each dimension has a set of levels associated with it. Take the table:

table (marital_status) (female)


------------------------------------------
                |           Female        
                |   Male   Female    Total
----------------+-------------------------
Marital Status  |                         
  Now married   |  4,670    4,060    8,730
  Widowed       |     79      277      356
  Divorced      |    766    1,086    1,852
  Separated     |    144      233      377
  Never married |  2,112    1,731    3,843
  Total         |  7,771    7,387   15,158
------------------------------------------

This has the dimensions marital_status and female. (It also has other dimensions that you’ll learn about later.) The layout of the table specifies that marital_status is the rows and female the columns.

The marital_status dimension has one level for each level of the variable marital_status, plus one for the totals. The level names come from the values of marital_status rather than the value labels, so they are 1, 2, 3, 4, and 5, plus .m for the totals. (You’ll learn later how to find this out for yourself.)

You can refer to a specific level of a specific dimension with the syntax dimension[level], for example marital_status[1]. You can also list multiple levels in the brackets. One use for this syntax is to specify that you only want to include certain levels in the table. For example, if you only want to include “Now married” and “Never married” in the table, you can do so with:

table (marital_status[1 5]) (female)


-----------------------------------------
                |          Female        
                |   Male   Female   Total
----------------+------------------------
Marital Status  |                        
  Now married   |  4,670    4,060   8,730
  Never married |  2,112    1,731   3,843
-----------------------------------------

The totals for marital_status are omitted because they’re just another level and not in the list of levels you asked for.

If you just want to omit the totals, you can do so with the nototal option:

table (marital_status) (female), nototal


---------------------------------
                |      Female    
                |   Male   Female
----------------+----------------
Marital Status  |                
  Now married   |  4,670    4,060
  Widowed       |     79      277
  Divorced      |    766    1,086
  Separated     |    144      233
  Never married |  2,112    1,731
---------------------------------

Alternatively, you can use the total() option to specify you want totals for one dimension and then they will be omitted for the other:

table (marital_status) (female), total(female)


---------------------------------
                |      Female    
                |   Male   Female
----------------+----------------
Marital Status  |                
  Now married   |  4,670    4,060
  Widowed       |     79      277
  Divorced      |    766    1,086
  Separated     |    144      233
  Never married |  2,112    1,731
  Total         |  7,771    7,387
---------------------------------

Note how totals calculated across female add a level to the marital_status dimension.

Exercise 2

Create a crosstab of race and hispanic again, but this time only include White, Black, Asian, and Other. Include a total row but not a total column. (Solution)

1.3 Super Rows and Super Columns

Super rows and super columns are rows and columns that contain other rows and columns. You specify them by listing two variables for a dimension. For example:

table (marital_status) (hispanic female), nototal


-------------------------------------------------
                |        Person is Hispanic      
                |   Not Hispanic       Hispanic  
                |      Female           Female   
                |   Male   Female   Male   Female
----------------+--------------------------------
Marital Status  |                                
  Now married   |  4,102    3,660    568      400
  Widowed       |     71      255      8       22
  Divorced      |    704      997     62       89
  Separated     |    117      190     27       43
  Never married |  1,794    1,511    318      220
-------------------------------------------------

The first variable listed becomes “super.” Changing the order gives the same results, but organized differently:

table (marital_status) (female hispanic), nototal


--------------------------------------------------------------------
                |                        Female                     
                |            Male                     Female        
                |     Person is Hispanic        Person is Hispanic  
                |  Not Hispanic   Hispanic   Not Hispanic   Hispanic
----------------+---------------------------------------------------
Marital Status  |                                                   
  Now married   |         4,102        568          3,660        400
  Widowed       |            71          8            255         22
  Divorced      |           704         62            997         89
  Separated     |           117         27            190         43
  Never married |         1,794        318          1,511        220
--------------------------------------------------------------------

Choose the organization that puts the numbers your readers are most likely to compare close to each other.

Tables with super rows can get long quickly:

table (marital_status hispanic) (female), nototal


----------------------------------------
                       |      Female    
                       |   Male   Female
-----------------------+----------------
Marital Status         |                
  Now married          |                
    Person is Hispanic |                
      Not Hispanic     |  4,102    3,660
      Hispanic         |    568      400
  Widowed              |                
    Person is Hispanic |                
      Not Hispanic     |     71      255
      Hispanic         |      8       22
  Divorced             |                
    Person is Hispanic |                
      Not Hispanic     |    704      997
      Hispanic         |     62       89
  Separated            |                
    Person is Hispanic |                
      Not Hispanic     |    117      190
      Hispanic         |     27       43
  Never married        |                
    Person is Hispanic |                
      Not Hispanic     |  1,794    1,511
      Hispanic         |    318      220
----------------------------------------

1.4 Adding Statistics

By default the table command puts frequencies in the table cells, i.e. the number of observations that have that particular combination of values. You can change that with the stat() option, including calculating multiple statistics. Start with a table that has female in the column dimension (we’ll add rows soon):

table () (female)


------------------------------------
          |           Female        
          |   Male   Female    Total
----------+-------------------------
Frequency |  7,771    7,387   15,158
------------------------------------

This is equivalent to using the stat(frequency) option:

table () (female), stat(frequency)


------------------------------------
          |           Female        
          |   Male   Female    Total
----------+-------------------------
Frequency |  7,771    7,387   15,158
------------------------------------

Now add percentages by adding stat(percent):

table () (female), stat(frequency) stat(percent)


------------------------------------
          |           Female        
          |   Male   Female    Total
----------+-------------------------
Frequency |  7,771    7,387   15,158
Percent   |  51.27    48.73   100.00
------------------------------------

While the frequency and percent statistics are based on female itself, you can also include statistics for other variables. To do so, use the stat() option and specify a statistic like before, but then include the variable or variables the statistic is to be calculated for in the parentheses as well. For example, you can add the means of age and income with stat(mean age income) and their standard deviations with stat(sd mean income). This command is getting long, so put one stat() option per line for readability:

table () (female), ///
    stat(frequency) ///
    stat(percent) ///
    stat(mean age income) ///
    stat(sd age income)


----------------------------------------------------
                   |              Female            
                   |      Male     Female      Total
-------------------+--------------------------------
Frequency          |     7,771      7,387     15,158
Percent            |     51.27      48.73     100.00
Mean               |                                
  Age              |  40.10398   40.20441   40.15292
  Income           |  42115.94   24718.83   33637.75
Standard deviation |                                
  Age              |   12.5318   12.64044   12.58455
  Income           |     49102   27538.59   41001.73
----------------------------------------------------

(Yes, we’d rather have the mean and standard deviation of each variable together. We’ll learn how soon.)

You can add frequencies for another variable with the fvfreq statistic, or “factor variable frequency.” This is distinguishes it from frequency (or just freq) which refers to the variable defining the dimension.

table () (female), ///
    stat(frequency) ///
    stat(percent) ///
    stat(mean age income) ///
    stat(sd age income) ///
    stat(fvfreq marital_status)


----------------------------------------------------------------
                               |              Female            
                               |      Male     Female      Total
-------------------------------+--------------------------------
Frequency                      |     7,771      7,387     15,158
Percent                        |     51.27      48.73     100.00
Mean                           |                                
  Age                          |  40.10398   40.20441   40.15292
  Income                       |  42115.94   24718.83   33637.75
Standard deviation             |                                
  Age                          |   12.5318   12.64044   12.58455
  Income                       |     49102   27538.59   41001.73
Factor-variable frequency      |                                
  Marital Status=Now married   |     4,670      4,060      8,730
  Marital Status=Widowed       |        79        277        356
  Marital Status=Divorced      |       766      1,086      1,852
  Marital Status=Separated     |       144        233        377
  Marital Status=Never married |     2,112      1,731      3,843
----------------------------------------------------------------

Now this is starting to look like “Table 1” from a paper!

Note that while we left the row dimension blank, our table not only has multiple rows, it has rows and super rows. What are they, and where did they come from? When you use the stat() option, table automatically creates a new dimension called result, with one level for each result that’s calculated. If your stat() option includes multiple variables, an additional dimension is created called var that contains the variables. For fvfreq this includes indicators for each level of the variable (1.marital_status, 2.marital_status, etc.).

table then assumes that you want the result dimension and the var dimension in your table and finds a place for them: in this case the empty row dimension. It also made result the super row. But you can take control by specifying where to put them. What we did before is equivalent to:

table (result var) (female), ///
    stat(frequency) ///
    stat(percent) ///
    stat(mean age income) ///
    stat(sd age income) ///
    stat(fvfreq marital_status)


----------------------------------------------------------------
                               |              Female            
                               |      Male     Female      Total
-------------------------------+--------------------------------
Frequency                      |     7,771      7,387     15,158
Percent                        |     51.27      48.73     100.00
Mean                           |                                
  Age                          |  40.10398   40.20441   40.15292
  Income                       |  42115.94   24718.83   33637.75
Standard deviation             |                                
  Age                          |   12.5318   12.64044   12.58455
  Income                       |     49102   27538.59   41001.73
Factor-variable frequency      |                                
  Marital Status=Now married   |     4,670      4,060      8,730
  Marital Status=Widowed       |        79        277        356
  Marital Status=Divorced      |       766      1,086      1,852
  Marital Status=Separated     |       144        233        377
  Marital Status=Never married |     2,112      1,731      3,843
----------------------------------------------------------------

Reversing the order of result and var will make a super row for each variable and put their mean and standard deviation together–almost certainly what you want:

table (var result) (female), ///
    stat(frequency) ///
    stat(percent) ///
    stat(mean age income) ///
    stat(sd age income) ///
    stat(fvfreq marital_status)


--------------------------------------------------------------
                             |              Female            
                             |      Male     Female      Total
-----------------------------+--------------------------------
Frequency                    |     7,771      7,387     15,158
Percent                      |     51.27      48.73     100.00
Age                          |                                
  Mean                       |  40.10398   40.20441   40.15292
  Standard deviation         |   12.5318   12.64044   12.58455
Income                       |                                
  Mean                       |  42115.94   24718.83   33637.75
  Standard deviation         |     49102   27538.59   41001.73
Marital Status=Now married   |                                
  Factor-variable frequency  |     4,670      4,060      8,730
Marital Status=Widowed       |                                
  Factor-variable frequency  |        79        277        356
Marital Status=Divorced      |                                
  Factor-variable frequency  |       766      1,086      1,852
Marital Status=Separated     |                                
  Factor-variable frequency  |       144        233        377
Marital Status=Never married |                                
  Factor-variable frequency  |     2,112      1,731      3,843
--------------------------------------------------------------

On the other hand it makes a real mess out of fvfreq. There are settings you can apply that will fix it, but many such settings have been collected into a style called table-1. It’s designed for the tables of summary statistics that make up the “Table 1” of so many papers. Apply it with the style() option:

table (var result) (female), ///
    stat(frequency) ///
    stat(percent) ///
    stat(mean age income) ///
    stat(sd age income) ///
    stat(fvfreq marital_status) ///
    style(table-1)


-------------------------------------------------
                |              Female            
                |      Male     Female      Total
----------------+--------------------------------
                |     7,771      7,387     15,158
                |     51.27      48.73     100.00
                |                                
            Age |  40.10398   40.20441   40.15292
                |   12.5318   12.64044   12.58455
                |                                
         Income |  42115.94   24718.83   33637.75
                |     49102   27538.59   41001.73
                |                                
 Marital Status |                                
  Now married   |     4,670      4,060      8,730
      Widowed   |        79        277        356
     Divorced   |       766      1,086      1,852
    Separated   |       144        233        377
Never married   |     2,112      1,731      3,843
-------------------------------------------------

Better, but it still needs some polishing. You’ll learn how to change the appearance of a table in the Appearance chapter.

If you want to use the above table in a Word document, you can save it in Word format by exporting the collection that the table command created automatically. We’ll have lots more to say about collections in the next chapter.

collect export table1.docx, replace

(collection Table exported to file table1.docx)

You can export tables to many other formats as well, including Excel, HTML, PDF, and LaTeX.

Exercise 3

Create a similar table with hispanic for the columns, and summary statistics for age, income, and edu. For income and age, report the 10th, 25th, 50th, 75th, and 90th percentiles rather than the mean and standard deviation, but don’t worry about labeling them. (Hint: in the stat() option, p# requests the #th percentile.) Report overall frequencies but not percentages. (Solution)