2  Collections

When you create a table using the table command, Stata creates a corresponding collection. A collection is a structured set of results that can be used to make a table. The collect command with its many subcommands acts directly on the collection. You can also use collect to create a new collection, collect the results of Stata commands, and print out the resulting table into it without ever running the table command.

2.1 Collection Properties

A collection is defined by its dimensions. Consider a basic frequency table:

clear all
use https://sscc.wisc.edu/sscc/pubs/stata_tables/acs
table (edu)




------------------------------------
                        |  Frequency
------------------------+-----------
Educational Attainment  |           
  Less Than High School |      2,488
  High School Diploma   |      4,205
  Some College          |      4,667
  Bachelor's            |      2,492
  Advanced Degree       |      1,306
  Total                 |     15,158
------------------------------------

You can see the dimensions it contains by running collect dims:

collect dims

Collection dimensions
Collection: Table
-----------------------------------------
                   Dimension   No. levels
-----------------------------------------
Layout, style, header, label
                      cmdset   1         
                     command   1         
                         edu   6         
                      result   1         
                     statcmd   1         

Style only
                border_block   4         
                   cell_type   4         
-----------------------------------------

Two of the dimensions will be familiar from the last chapter: edu and result. You can see the contents of a dimension with collect levelsof:

collect levelsof edu

Collection: Table
 Dimension: edu
    Levels: .m 1 2 3 4 5

The edu dimension contains the levels of the edu variable plus .m for the total. The result dimension contains just frequency because that’s the only statistic this table calculates.

collect levelsof result

Collection: Table
 Dimension: result
    Levels: frequency

Stata calls a combination of dimension and level a tag. To refer to a specific tag, you give the dimension name followed by the level in brackets, like edu[1] or result[frequency].

The edu variable in the data set has value labels applied to it, so the edu dimension also has value labels. You can see them with:

collect label list edu

  Collection: Table
   Dimension: edu
       Label: Educational Attainment
Level labels:
          .m  Total
           1  Less Than High School
           2  High School Diploma
           3  Some College
           4  Bachelor's
           5  Advanced Degree

The result dimension also has a label that was applied automatically, just to make the resulting table look better:

collect label list result

  Collection: Table
   Dimension: result
       Label: Result
Level labels:
   frequency  Frequency

You can see the structure of the table with collect layout. This will also print the table:

collect layout

Collection: Table
      Rows: edu
   Columns: result
   Table 1: 7 x 1

------------------------------------
                        |  Frequency
------------------------+-----------
Educational Attainment  |           
  Less Than High School |      2,488
  High School Diploma   |      4,205
  Some College          |      4,667
  Bachelor's            |      2,492
  Advanced Degree       |      1,306
  Total                 |     15,158
------------------------------------

Exploring the other dimensions of the collection and their levels is a great way to learn more about how collections work.

Exercise 1

Identify the levels and labels of the dimension command. Can you guess what it means? (Solution)

2.2 Creating Collections

When you run table it creates a collection called Table automatically. You can create a new collection from scratch with collect create. Create one called table2 with:

collect create table2
(current collection is table2)

This also makes table2 the current table, meaning collect commands now act on it. For example, collect dims now returns nothing because table2 does not have any dimensions:

collect dims

You can use collect set to change the current table:

collect set Table
collect dims

(current collection is Table)


Collection dimensions
Collection: Table
-----------------------------------------
                   Dimension   No. levels
-----------------------------------------
Layout, style, header, label
                      cmdset   1         
                     command   1         
                         edu   6         
                      result   1         
                     statcmd   1         

Style only
                border_block   4         
                   cell_type   4         
-----------------------------------------

Now you’re back to the original Table.

On the other hand, if you clear the current collection from memory, Stata will automatically create a new one called default as soon as you store some results. So unless you plan to work with more than one table at a time, the easy way to create a new table is to run:

collect clear

2.3 Putting Results in a Collection

If you put the collect prefix in front of a command, the results of the command will be added to the current collection. Similarly, the collect get command will put the results of the previous command in the collection.

Put the results of a basic regression in the default collection with:

collect: reg income age i.female

      Source |       SS           df       MS      Number of obs   =    15,158
-------------+----------------------------------   F(2, 15155)     =    667.49
       Model |  2.0629e+12         2  1.0314e+12   Prob > F        =    0.0000
    Residual |  2.3418e+13    15,155  1.5452e+09   R-squared       =    0.0810
-------------+----------------------------------   Adj R-squared   =    0.0808
       Total |  2.5481e+13    15,157  1.6811e+09   Root MSE        =     39310

------------------------------------------------------------------------------
      income | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   617.9673   25.37219    24.36   0.000     568.2348    667.6999
             |
      female |
     Female  |  -17459.19   638.7791   -27.33   0.000    -18711.27    -16207.1
       _cons |      17333   1110.948    15.60   0.000     15155.41    19510.59
------------------------------------------------------------------------------

Now take a look at the dimensions of the new collection:

collect dims

Collection dimensions
Collection: default
-----------------------------------------
                   Dimension   No. levels
-----------------------------------------
Layout, style, header, label
                      cmdset   1         
                       coleq   1         
                     colname   4         
           colname_remainder   1         
                      female   2         
               program_class   1         
                      result   32        
                 result_type   3         
                     rowname   1         

Style only
                border_block   4         
                   cell_type   4         
-----------------------------------------

The two most important dimensions are colname and result. To see what they contain, start by listing the levels of colname:

collect levelsof colname

Collection: default
 Dimension: colname
    Levels: age 0.female 1.female _cons

So colname lists the predictors in the regression model. Of course you’ll usually put them in rows. The dimension is called colname because it contains the column names from matrices of results produced by the regression.

The names of the levels in result are kind of cryptic, so list their labels:

collect label list result

  Collection: default
   Dimension: result
       Label: Result
Level labels:
           F  F statistic
           N  Number of observations
        _r_b  Coefficient
       _r_ci  __LEVEL__% CI
       _r_df  df
       _r_lb  __LEVEL__% lower bound
        _r_p  p-value
       _r_se  Std. error
       _r_ub  __LEVEL__% upper bound
        _r_z  t
    _r_z_abs  |t|
        beta  Standardized coefficient
         cmd  Command
     cmdline  Command line as typed
      depvar  Dependent variable
        df_m  Model DF
        df_r  Residual DF
   estat_cmd  Program used to implement estat
          ll  Log likelihood
        ll_0  Log likelihood, constant-only model
   marginsok  Predictions allowed by margins
       model  Model
         mss  Model sum of squares
     predict  Program used to implement predict
  properties  Command properties
          r2  R-squared
        r2_a  Adjusted R-squared
        rank  Rank of VCE
        rmse  RMSE
         rss  Residual sum of squares
       title  Title of output
         vce  SE method

The result dimension contains everything there is to know about the model. Some of the results you probably care about a great deal, like the coefficients, and some are only of interest to Stata itself.

If you’re familiar with how Stata commands store results in the e() and r() vectors, some of these levels will be familiar. But a lot of them have an additional _r_ in front of them, like _r_b. Think of the _r_ as standing for “reported.” If you run the logit command with the or option, it will report odds ratios in in its results but store the coefficients in e(b). However, _r_b will contain the odds ratios because that is what was reported. The collect command always collects whatever was reported.

Before you can print out this collection you need to specify a layout. This is done with the collect layout command followed by specifying the rows and columns just like with the table command. Begin by making colname (the predictors) rows and result columns:

collect layout (colname) (result)

Collection: default
      Rows: colname
   Columns: result
   Table 1: 4 x 10

------------------------------------------------------------------------------------------------------------------------------------------
          | Coefficient        95% CI          df 95% lower bound p-value Std. error 95% upper bound      t   |t| Standardized coefficient
----------+-------------------------------------------------------------------------------------------------------------------------------
Age       |    617.9673  568.2348  667.6999 15155        568.2348   0.000   25.37219        667.6999  24.36 24.36                  .189671
Male      |           0                                                            0                                                     0
Female    |   -17459.19 -18711.27  -16207.1 15155       -18711.27   0.000   638.7791        -16207.1 -27.33 27.33                -.2128466
Intercept |       17333  15155.41  19510.59 15155        15155.41   0.000   1110.948        19510.59  15.60 15.60                         
------------------------------------------------------------------------------------------------------------------------------------------

This has more columns than you really want, but still doesn’t include most of the levels of result. What happened to the rest? To see them, make result the rows:

collect layout (result)

Collection: default
      Rows: result
   Table 1: 23 x 1

-----------------------------------------------------------------
F statistic                         |                    667.4868
Number of observations              |                       15158
Standardized coefficient            |                     .189671
Command                             |                     regress
Command line as typed               | regress income age i.female
Dependent variable                  |                      income
Model DF                            |                           2
Residual DF                         |                       15155
Program used to implement estat     |               regress_estat
Log likelihood                      |                   -181866.7
Log likelihood, constant-only model |                   -182506.5
Predictions allowed by margins      |                  XB default
Model                               |                         ols
Model sum of squares                |                    2.06e+12
Program used to implement predict   |                    regres_p
Command properties                  |                         b V
R-squared                           |                    .0809567
Adjusted R-squared                  |                    .0808354
Rank of VCE                         |                           3
RMSE                                |                    39309.62
Residual sum of squares             |                    2.34e+13
Title of output                     |           Linear regression
SE method                           |                         ols
-----------------------------------------------------------------

This gets us all the missing levels, but basics like coefficients and standard errors are not included. What’s going on?

The problem will be familiar to anyone who uses hierarchical data. Results like “Number of observations” have just one value, while results like coefficients have one value per predictor–or in collection terms, one value per colname. The regression coefficient for age, for example, has both the tag colname[age] and the tag result[_r_b]. A table with the layout (colname) (result) will only include elements that have both a colname tag and a result tag, while a table with the layout (result) will only include elements that only have a result tag.

So what do you do if you want to have predictors as rows plus a row for number of observations? Give number of observations a colnames tag with collect addtags:

collect addtags colname[N], fortags(result[N])
(1 items changed in collection default)

This adds the tag colnames[N] to the value already tagged with result[N]. Alternatively you could call it Obs or whatever you want to appear in the table. (You’ll learn how to apply a label to it in the next chapter.)

To better understand what this does, use the layout (colname) (result[_r_b _r_se N]. Recall that square brackets allow us to choose which levels of a dimension are included in the table.

collect layout (colname) (result[_r_b _r_se N])

Collection: default
      Rows: colname
   Columns: result[_r_b _r_se N]
   Table 1: 5 x 3

---------------------------------------------------------
          | Coefficient Std. error Number of observations
----------+----------------------------------------------
Age       |    617.9673   25.37219                       
Male      |           0          0                       
Female    |   -17459.19   638.7791                       
N         |                                         15158
Intercept |       17333   1110.948                       
---------------------------------------------------------

Note how colname[age], colname[female], and colname[_cons] (the constant) have values tagged with result[_r_b] (“Coefficient”) and result[_r_se] (“Std. error”) but not result[N]. Likewise, colname[N] only has a value tagged with result[N]. This obviously doesn’t make for a very good table.

Also note that by default the intercept is always last. To make colname[N] the last row you need to specify the order you want the colname levels to appear in, again by putting them in brackets.

You can make a very nice table out of this collection by making result into rows and colname into super rows. This done almost like with the table command, but with collect layout you put a # between the super rows and rows, like specifying an interaction in a regression model. Don’t forget to specify the levels to include and the order to put them in:

collect layout (colname[age female _cons N]#result[_r_b _r_se N])

Collection: default
      Rows: colname[age female _cons N]#result[_r_b _r_se N]
   Table 1: 14 x 1

------------------------------------
Age                      |          
  Coefficient            |  617.9673
  Std. error             |  25.37219
Male                     |          
  Coefficient            |         0
  Std. error             |         0
Female                   |          
  Coefficient            | -17459.19
  Std. error             |  638.7791
Intercept                |          
  Coefficient            |     17333
  Std. error             |  1110.948
N                        |          
  Number of observations |     15158
------------------------------------

Now the missing missing tags don’t matter. You’ll learn how to polish this table in the next chapter.

Exercise 2

Add log likelihood to to the table. Also, the colname level female consists of 0.female and 1.female but 0.female is not of interest, so only ask for 1.female. (Solution)

2.3.1 Collect and By

Now consider running separate regressions for males and females. You can easily do that, and create a table with both results, by combining the collect: and by: prefixes.

collect clear
collect: bysort female: reg income age



--------------------------------------------------------------------------------
-> female = Male

      Source |       SS           df       MS      Number of obs   =     7,771
-------------+----------------------------------   F(1, 7769)      =    431.80
       Model |  9.8638e+11         1  9.8638e+11   Prob > F        =    0.0000
    Residual |  1.7747e+13     7,769  2.2844e+09   R-squared       =    0.0527
-------------+----------------------------------   Adj R-squared   =    0.0525
       Total |  1.8734e+13     7,770  2.4110e+09   Root MSE        =     47795

------------------------------------------------------------------------------
      income | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   899.0809   43.26707    20.78   0.000     814.2658     983.896
       _cons |   6059.225   1817.915     3.33   0.001     2495.623    9622.828
------------------------------------------------------------------------------

--------------------------------------------------------------------------------
-> female = Female

      Source |       SS           df       MS      Number of obs   =     7,387
-------------+----------------------------------   F(1, 7385)      =    170.53
       Model |  1.2642e+11         1  1.2642e+11   Prob > F        =    0.0000
    Residual |  5.4749e+12     7,385   741357969   R-squared       =    0.0226
-------------+----------------------------------   Adj R-squared   =    0.0224
       Total |  5.6014e+12     7,386   758374186   Root MSE        =     27228

------------------------------------------------------------------------------
      income | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   327.3002    25.0638    13.06   0.000      278.168    376.4324
       _cons |   11559.91     1056.3    10.94   0.000     9489.265    13630.56
------------------------------------------------------------------------------

Now examine the dimensions of the resulting collection:

collect dims

Collection dimensions
Collection: default
-----------------------------------------
                   Dimension   No. levels
-----------------------------------------
Layout, style, header, label
                      cmdset   2         
                       coleq   1         
                     colname   2         
                      female   2         
               program_class   1         
                      result   32        
                 result_type   3         
                     rowname   1         

Style only
                border_block   4         
                   cell_type   4         
-----------------------------------------

colnames has fewer levels just because we have fewer predictors, but note that cmdset now has two levels instead of one. Also note that female is still a dimension even though it’s no longer a predictor.

Now try making a table similar to the last one (without including the number of observations):

collect layout (colname#result[_r_b _r_se])

Collection: default
      Rows: colname#result[_r_b _r_se]

Your layout specification does not uniquely match any items. One or both of the
dimensions cmdset and female might help uniquely match items.

This fails because there are now two sets of results: one for each by group. It would make sense to put them in two columns, but we need a column dimension.

The error message helpfully suggests looking at cmdset and female. Start with the levels of cmdset:

collect levelsof cmdset

Collection: default
 Dimension: cmdset
    Levels: 1 2

This does look like an identifier, but not a very enlightening one. Check the labels:

collect label list cmdset

Collection: default
 Dimension: cmdset
     Label: Command results index

This confirms the purpose of cmdset: it does indeed identify (number) the commands run. You could apply labels to the values of cmdset to make useful column titles. But instead consider female:

collect levelsof female

Collection: default
 Dimension: female
    Levels: 0 1
collect label list female

  Collection: default
   Dimension: female
       Label: Female
Level labels:
           0  Male
           1  Female

That will make much better column titles:

collect layout (colname#result[_r_b _r_se]) (female)

Collection: default
      Rows: colname#result[_r_b _r_se]
   Columns: female
   Table 1: 6 x 2

---------------------------------
              |     Male   Female
--------------+------------------
Age           |                  
  Coefficient | 899.0809 327.3002
  Std. error  | 43.26707  25.0638
Intercept     |                  
  Coefficient | 6059.225 11559.91
  Std. error  | 1817.915   1056.3
---------------------------------

So how did this work? The by: prefix runs the command after it multiple times, once for each level of the by variable. In this case, the regression was run twice since female has two levels. collect collected the results of each command, and in addition to all the usual tags tagged them with cmdset[1] for the first command run and cmdset[2] for the second command run. However, it also understood that these were the result of a by female: prefix, so it also tagged the two sets of results with female[0] and female[1] respectively. Thus we can use either cmdset or female for columns, but female already has nice value labels applied.

Exercise 3

Regress income on i.edu, with separate models for Hispanic and Not Hispanic. Just for variety, make separate tables for each model rather than two columns (recall that the third dimension is separate tables). These models will have very different sample sizes so include sample size in the table, but don’t worry about making it last. (Solution)

2.4 Collecting Results From Multiple Commands

Up to this point we’ve only used the collect: prefix to collect results once per table. But if we run it more than once, the new results will simply be added to the table. Thus we can collect as many results as we need from as many commands as we need.

Suppose you wanted to run a set of nested OLS regression models for income and then display the results in a single table, one model per column. You’ll need a dimension to keep track of the models and label their columns. You could use the cmdset dimension that will be created automatically, but it’s clearer if you create your own dimension. You can also set the levels to be what you want to appear in the column headers.

To do so, create a new collection, then run the first model with the collect: prefix. Start with female and education as predictors (both factor variables).

The collect: prefix takes a tag() option that lets you specify a tag that will be associated with all the results of the following command. Call the dimension model. This will be model one, but in the table we want that to show up as (1) so make that the name of the level.

collect clear
collect, tag(model[(1)]): reg income i.female i.edu



      Source |       SS           df       MS      Number of obs   =    15,158
-------------+----------------------------------   F(5, 15152)     =    656.63
       Model |  4.5380e+12         5  9.0759e+11   Prob > F        =    0.0000
    Residual |  2.0943e+13    15,152  1.3822e+09   R-squared       =    0.1781
-------------+----------------------------------   Adj R-squared   =    0.1778
       Total |  2.5481e+13    15,157  1.6811e+09   Root MSE        =     37178

-------------------------------------------------------------------------------
       income | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
--------------+----------------------------------------------------------------
       female |
      Female  |  -17536.46   605.1548   -28.98   0.000    -18722.64   -16350.28
              |
          edu |
High Schoo..  |   7976.849   940.9349     8.48   0.000     6132.503    9821.194
Some College  |   14402.38   924.3338    15.58   0.000     12590.58    16214.19
  Bachelor's  |   32385.91   1054.114    30.72   0.000     30319.72     34452.1
Advanced D..  |   53206.38   1270.507    41.88   0.000     50716.03    55696.73
              |
        _cons |   25628.12   790.4475    32.42   0.000     24078.75    27177.49
-------------------------------------------------------------------------------

For model two add age as a predictor, and collect the results with the tag model[(2)]:

collect, tag(model[(2)]): reg income i.female i.edu age

      Source |       SS           df       MS      Number of obs   =    15,158
-------------+----------------------------------   F(6, 15151)     =    643.63
       Model |  5.1756e+12         6  8.6260e+11   Prob > F        =    0.0000
    Residual |  2.0305e+13    15,151  1.3402e+09   R-squared       =    0.2031
-------------+----------------------------------   Adj R-squared   =    0.2028
       Total |  2.5481e+13    15,157  1.6811e+09   Root MSE        =     36609

-------------------------------------------------------------------------------
       income | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
--------------+----------------------------------------------------------------
       female |
      Female  |  -17635.83   595.9085   -29.59   0.000    -18803.88   -16467.77
              |
          edu |
High Schoo..  |   7789.569   926.5708     8.41   0.000     5973.378    9605.759
Some College  |   14988.99   910.5812    16.46   0.000     13204.14    16773.84
  Bachelor's  |   32429.34    1037.98    31.24   0.000     30394.78    34463.91
Advanced D..  |   50349.74   1257.894    40.03   0.000     47884.12    52815.36
              |
          age |   520.3155   23.85438    21.81   0.000      473.558    567.0729
        _cons |   4894.683    1228.56     3.98   0.000     2486.557    7302.809
-------------------------------------------------------------------------------

For model three add age squared by replacing age with c.age##c.age:

collect, tag(model[(3)]): reg income i.female i.edu c.age##c.age

      Source |       SS           df       MS      Number of obs   =    15,158
-------------+----------------------------------   F(7, 15150)     =    582.65
       Model |  5.4048e+12         7  7.7211e+11   Prob > F        =    0.0000
    Residual |  2.0076e+13    15,150  1.3252e+09   R-squared       =    0.2121
-------------+----------------------------------   Adj R-squared   =    0.2117
       Total |  2.5481e+13    15,157  1.6811e+09   Root MSE        =     36403

-------------------------------------------------------------------------------
       income | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
--------------+----------------------------------------------------------------
       female |
      Female  |  -17533.85   592.6066   -29.59   0.000    -18695.43   -16372.27
              |
          edu |
High Schoo..  |    6753.61   924.7195     7.30   0.000     4941.048    8566.171
Some College  |   13655.69    911.117    14.99   0.000     11869.79    15441.59
  Bachelor's  |   30414.44    1043.45    29.15   0.000     28369.15    32459.73
Advanced D..  |   48301.99   1260.472    38.32   0.000     45831.31    50772.67
              |
          age |   2519.698   153.8792    16.37   0.000     2218.076     2821.32
              |
  c.age#c.age |  -24.46875   1.860689   -13.15   0.000    -28.11593   -20.82158
              |
        _cons |  -30905.68   2983.924   -10.36   0.000    -36754.54   -25056.83
-------------------------------------------------------------------------------

Before creating the table, look at the dimensions of of the collection and the levels of model:

collect dims

Collection dimensions
Collection: default
-----------------------------------------
                   Dimension   No. levels
-----------------------------------------
Layout, style, header, label
                      cmdset   3         
                       coleq   1         
                     colname   10        
           colname_remainder   1         
                         edu   5         
                      female   2         
                       model   3         
               program_class   1         
                      result   32        
                 result_type   3         
                     rowname   1         

Style only
                border_block   4         
                   cell_type   4         
-----------------------------------------
collect levelsof model

Collection: default
 Dimension: model
    Levels: (1) (2) (3)

Finally set the table layout, with same rows as we’ve used previously and model for columns:

collect layout (colname#result[_r_b _r_se]) (model)

Collection: default
      Rows: colname#result[_r_b _r_se]
   Columns: model
   Table 1: 30 x 3

-----------------------------------------------------
                      |       (1)       (2)       (3)
----------------------+------------------------------
Male                  |                              
  Coefficient         |         0         0         0
  Std. error          |         0         0         0
Female                |                              
  Coefficient         | -17536.46 -17635.83 -17533.85
  Std. error          |  605.1548  595.9085  592.6066
Less Than High School |                              
  Coefficient         |         0         0         0
  Std. error          |         0         0         0
High School Diploma   |                              
  Coefficient         |  7976.849  7789.569   6753.61
  Std. error          |  940.9349  926.5708  924.7195
Some College          |                              
  Coefficient         |  14402.38  14988.99  13655.69
  Std. error          |  924.3338  910.5812   911.117
Bachelor's            |                              
  Coefficient         |  32385.91  32429.34  30414.44
  Std. error          |  1054.114   1037.98   1043.45
Advanced Degree       |                              
  Coefficient         |  53206.38  50349.74  48301.99
  Std. error          |  1270.507  1257.894  1260.472
Age                   |                              
  Coefficient         |            520.3155  2519.698
  Std. error          |            23.85438  153.8792
Age # Age             |                              
  Coefficient         |                     -24.46875
  Std. error          |                      1.860689
Intercept             |                              
  Coefficient         |  25628.12  4894.683 -30905.68
  Std. error          |  790.4475   1228.56  2983.924
-----------------------------------------------------

Exercise 4

Add another model (model four) that includes age cubed. But to help you decide if adding yet another term to the model is a good idea or not, add R-squared to the table so you can tell if adding age-cubed increased R-squared significantly. Don’t worry about putting R-squared last. (Solution)