When you create a table using the `table` command, Stata creates a corresponding collection. A collection is a structured set of results that can be used to make a table. The `collect` command with its many subcommands acts directly on the collection. You can also use `collect` to create a new collection, collect the results of Stata commands, and print out the resulting table into it without ever running the `table` command.

## 2.1 Collection Properties

A collection is defined by its dimensions. Consider a basic frequency table:

``````clear all
use https://sscc.wisc.edu/sscc/pubs/stata_tables/acs
table (edu)``````
``````

------------------------------------
|  Frequency
------------------------+-----------
Educational Attainment  |
Less Than High School |      2,488
High School Diploma   |      4,205
Some College          |      4,667
Bachelor's            |      2,492
Total                 |     15,158
------------------------------------``````

You can see the dimensions it contains by running `collect dims`:

``collect dims``
``````
Collection dimensions
Collection: Table
-----------------------------------------
Dimension   No. levels
-----------------------------------------
cmdset   1
command   1
edu   6
result   1
statcmd   1

Style only
border_block   4
cell_type   4
-----------------------------------------``````

Two of the dimensions will be familiar from the last chapter: `edu` and `result`. You can see the contents of a dimension with `collect levelsof`:

``collect levelsof edu``
``````
Collection: Table
Dimension: edu
Levels: .m 1 2 3 4 5``````

The `edu` dimension contains the levels of the `edu` variable plus `.m` for the total. The `result` dimension contains just `frequency` because that’s the only statistic this table calculates.

``collect levelsof result``
``````
Collection: Table
Dimension: result
Levels: frequency``````

Stata calls a combination of dimension and level a tag. To refer to a specific tag, you give the dimension name followed by the level in brackets, like `edu[1]` or `result[frequency]`.

The `edu` variable in the data set has value labels applied to it, so the `edu` dimension also has value labels. You can see them with:

``collect label list edu``
``````
Collection: Table
Dimension: edu
Label: Educational Attainment
Level labels:
.m  Total
1  Less Than High School
2  High School Diploma
3  Some College
4  Bachelor's

The `result` dimension also has a label that was applied automatically, just to make the resulting table look better:

``collect label list result``
``````
Collection: Table
Dimension: result
Label: Result
Level labels:
frequency  Frequency``````

You can see the structure of the table with `collect layout`. This will also print the table:

``collect layout``
``````
Collection: Table
Rows: edu
Columns: result
Table 1: 7 x 1

------------------------------------
|  Frequency
------------------------+-----------
Educational Attainment  |
Less Than High School |      2,488
High School Diploma   |      4,205
Some College          |      4,667
Bachelor's            |      2,492
Total                 |     15,158
------------------------------------``````

Exploring the other dimensions of the collection and their levels is a great way to learn more about how collections work.

### Exercise 1

Identify the levels and labels of the dimension `command`. Can you guess what it means? (Solution)

## 2.2 Creating Collections

When you run `table` it creates a collection called `Table` automatically. You can create a new collection from scratch with `collect create`. Create one called `table2` with:

``collect create table2``
``(current collection is table2)``

This also makes `table2` the current table, meaning `collect` commands now act on it. For example, `collect dims` now returns nothing because `table2` does not have any dimensions:

``collect dims``

You can use `collect set` to change the current table:

``````collect set Table
collect dims``````
``````
(current collection is Table)

Collection dimensions
Collection: Table
-----------------------------------------
Dimension   No. levels
-----------------------------------------
cmdset   1
command   1
edu   6
result   1
statcmd   1

Style only
border_block   4
cell_type   4
-----------------------------------------``````

Now you’re back to the original `Table`.

On the other hand, if you clear the current collection from memory, Stata will automatically create a new one called `default` as soon as you store some results. So unless you plan to work with more than one table at a time, the easy way to create a new table is to run:

``collect clear``

## 2.3 Putting Results in a Collection

If you put the `collect` prefix in front of a command, the results of the command will be added to the current collection. Similarly, the `collect get` command will put the results of the previous command in the collection.

Put the results of a basic regression in the `default` collection with:

``collect: reg income age i.female``
``````
Source |       SS           df       MS      Number of obs   =    15,158
-------------+----------------------------------   F(2, 15155)     =    667.49
Model |  2.0629e+12         2  1.0314e+12   Prob > F        =    0.0000
Residual |  2.3418e+13    15,155  1.5452e+09   R-squared       =    0.0810
Total |  2.5481e+13    15,157  1.6811e+09   Root MSE        =     39310

------------------------------------------------------------------------------
income | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
age |   617.9673   25.37219    24.36   0.000     568.2348    667.6999
|
female |
Female  |  -17459.19   638.7791   -27.33   0.000    -18711.27    -16207.1
_cons |      17333   1110.948    15.60   0.000     15155.41    19510.59
------------------------------------------------------------------------------``````

Now take a look at the dimensions of the new collection:

``collect dims``
``````
Collection dimensions
Collection: default
-----------------------------------------
Dimension   No. levels
-----------------------------------------
cmdset   1
coleq   1
colname   4
colname_remainder   1
female   2
program_class   1
result   32
result_type   3
rowname   1

Style only
border_block   4
cell_type   4
-----------------------------------------``````

The two most important dimensions are `colname` and `result`. To see what they contain, start by listing the levels of `colname`:

``collect levelsof colname``
``````
Collection: default
Dimension: colname
Levels: age 0.female 1.female _cons``````

So `colname` lists the predictors in the regression model. Of course you’ll usually put them in rows. The dimension is called `colname` because it contains the column names from matrices of results produced by the regression.

The names of the levels in `result` are kind of cryptic, so list their labels:

``collect label list result``
``````
Collection: default
Dimension: result
Label: Result
Level labels:
F  F statistic
N  Number of observations
_r_b  Coefficient
_r_ci  __LEVEL__% CI
_r_df  df
_r_lb  __LEVEL__% lower bound
_r_p  p-value
_r_se  Std. error
_r_ub  __LEVEL__% upper bound
_r_z  t
_r_z_abs  |t|
beta  Standardized coefficient
cmd  Command
cmdline  Command line as typed
depvar  Dependent variable
df_m  Model DF
df_r  Residual DF
estat_cmd  Program used to implement estat
ll  Log likelihood
ll_0  Log likelihood, constant-only model
marginsok  Predictions allowed by margins
model  Model
mss  Model sum of squares
predict  Program used to implement predict
properties  Command properties
r2  R-squared
rank  Rank of VCE
rmse  RMSE
title  Title of output
vce  SE method``````

The `result` dimension contains everything there is to know about the model. Some of the results you probably care about a great deal, like the coefficients, and some are only of interest to Stata itself.

If you’re familiar with how Stata commands store results in the `e()` and `r()` vectors, some of these levels will be familiar. But a lot of them have an additional `_r_` in front of them, like `_r_b`. Think of the `_r_` as standing for “reported.” If you run the `logit` command with the `or` option, it will report odds ratios in in its results but store the coefficients in `e(b)`. However, `_r_b` will contain the odds ratios because that is what was reported. The `collect` command always collects whatever was reported.

Before you can print out this collection you need to specify a layout. This is done with the `collect layout` command followed by specifying the rows and columns just like with the `table` command. Begin by making `colname` (the predictors) rows and `result` columns:

``collect layout (colname) (result)``
``````
Collection: default
Rows: colname
Columns: result
Table 1: 4 x 10

------------------------------------------------------------------------------------------------------------------------------------------
| Coefficient        95% CI          df 95% lower bound p-value Std. error 95% upper bound      t   |t| Standardized coefficient
----------+-------------------------------------------------------------------------------------------------------------------------------
Age       |    617.9673  568.2348  667.6999 15155        568.2348   0.000   25.37219        667.6999  24.36 24.36                  .189671
Male      |           0                                                            0                                                     0
Female    |   -17459.19 -18711.27  -16207.1 15155       -18711.27   0.000   638.7791        -16207.1 -27.33 27.33                -.2128466
Intercept |       17333  15155.41  19510.59 15155        15155.41   0.000   1110.948        19510.59  15.60 15.60
------------------------------------------------------------------------------------------------------------------------------------------``````

This has more columns than you really want, but still doesn’t include most of the levels of `result`. What happened to the rest? To see them, make `result` the rows:

``collect layout (result)``
``````
Collection: default
Rows: result
Table 1: 23 x 1

-----------------------------------------------------------------
F statistic                         |                    667.4868
Number of observations              |                       15158
Standardized coefficient            |                     .189671
Command                             |                     regress
Command line as typed               | regress income age i.female
Dependent variable                  |                      income
Model DF                            |                           2
Residual DF                         |                       15155
Program used to implement estat     |               regress_estat
Log likelihood                      |                   -181866.7
Log likelihood, constant-only model |                   -182506.5
Predictions allowed by margins      |                  XB default
Model                               |                         ols
Model sum of squares                |                    2.06e+12
Program used to implement predict   |                    regres_p
Command properties                  |                         b V
R-squared                           |                    .0809567
Rank of VCE                         |                           3
RMSE                                |                    39309.62
Residual sum of squares             |                    2.34e+13
Title of output                     |           Linear regression
SE method                           |                         ols
-----------------------------------------------------------------``````

This gets us all the missing levels, but basics like coefficients and standard errors are not included. What’s going on?

The problem will be familiar to anyone who uses hierarchical data. Results like “Number of observations” have just one value, while results like coefficients have one value per predictor–or in collection terms, one value per `colname`. The regression coefficient for `age`, for example, has both the tag `colname[age]` and the tag `result[_r_b]`. A table with the layout `(colname) (result)` will only include elements that have both a `colname` tag and a `result` tag, while a table with the layout `(result)` will only include elements that only have a `result` tag.

So what do you do if you want to have predictors as rows plus a row for number of observations? Give number of observations a `colnames` tag with `collect addtags`:

``collect addtags colname[N], fortags(result[N])``
``(1 items changed in collection default)``

This adds the tag `colnames[N]` to the value already tagged with `result[N]`. Alternatively you could call it `Obs` or whatever you want to appear in the table. (You’ll learn how to apply a label to it in the next chapter.)

To better understand what this does, use the layout `(colname) (result[_r_b _r_se N]`. Recall that square brackets allow us to choose which levels of a dimension are included in the table.

``collect layout (colname) (result[_r_b _r_se N])``
``````
Collection: default
Rows: colname
Columns: result[_r_b _r_se N]
Table 1: 5 x 3

---------------------------------------------------------
| Coefficient Std. error Number of observations
----------+----------------------------------------------
Age       |    617.9673   25.37219
Male      |           0          0
Female    |   -17459.19   638.7791
N         |                                         15158
Intercept |       17333   1110.948
---------------------------------------------------------``````

Note how `colname[age]`, `colname[female]`, and `colname[_cons]` (the constant) have values tagged with `result[_r_b]` (“Coefficient”) and `result[_r_se]` (“Std. error”) but not `result[N]`. Likewise, `colname[N]` only has a value tagged with `result[N]`. This obviously doesn’t make for a very good table.

Also note that by default the intercept is always last. To make `colname[N]` the last row you need to specify the order you want the `colname` levels to appear in, again by putting them in brackets.

You can make a very nice table out of this collection by making `result` into rows and `colname` into super rows. This done almost like with the `table` command, but with `collect layout` you put a `#` between the super rows and rows, like specifying an interaction in a regression model. Don’t forget to specify the levels to include and the order to put them in:

``collect layout (colname[age female _cons N]#result[_r_b _r_se N])``
``````
Collection: default
Rows: colname[age female _cons N]#result[_r_b _r_se N]
Table 1: 14 x 1

------------------------------------
Age                      |
Coefficient            |  617.9673
Std. error             |  25.37219
Male                     |
Coefficient            |         0
Std. error             |         0
Female                   |
Coefficient            | -17459.19
Std. error             |  638.7791
Intercept                |
Coefficient            |     17333
Std. error             |  1110.948
N                        |
Number of observations |     15158
------------------------------------``````

Now the missing missing tags don’t matter. You’ll learn how to polish this table in the next chapter.

### Exercise 2

Add log likelihood to to the table. Also, the `colname` level `female` consists of `0.female` and `1.female` but `0.female` is not of interest, so only ask for `1.female`. (Solution)

### 2.3.1 Collect and By

Now consider running separate regressions for males and females. You can easily do that, and create a table with both results, by combining the `collect:` and `by:` prefixes.

``````collect clear
collect: bysort female: reg income age``````
``````

--------------------------------------------------------------------------------
-> female = Male

Source |       SS           df       MS      Number of obs   =     7,771
-------------+----------------------------------   F(1, 7769)      =    431.80
Model |  9.8638e+11         1  9.8638e+11   Prob > F        =    0.0000
Residual |  1.7747e+13     7,769  2.2844e+09   R-squared       =    0.0527
Total |  1.8734e+13     7,770  2.4110e+09   Root MSE        =     47795

------------------------------------------------------------------------------
income | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
age |   899.0809   43.26707    20.78   0.000     814.2658     983.896
_cons |   6059.225   1817.915     3.33   0.001     2495.623    9622.828
------------------------------------------------------------------------------

--------------------------------------------------------------------------------
-> female = Female

Source |       SS           df       MS      Number of obs   =     7,387
-------------+----------------------------------   F(1, 7385)      =    170.53
Model |  1.2642e+11         1  1.2642e+11   Prob > F        =    0.0000
Residual |  5.4749e+12     7,385   741357969   R-squared       =    0.0226
Total |  5.6014e+12     7,386   758374186   Root MSE        =     27228

------------------------------------------------------------------------------
income | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
age |   327.3002    25.0638    13.06   0.000      278.168    376.4324
_cons |   11559.91     1056.3    10.94   0.000     9489.265    13630.56
------------------------------------------------------------------------------``````

Now examine the dimensions of the resulting collection:

``collect dims``
``````
Collection dimensions
Collection: default
-----------------------------------------
Dimension   No. levels
-----------------------------------------
cmdset   2
coleq   1
colname   2
female   2
program_class   1
result   32
result_type   3
rowname   1

Style only
border_block   4
cell_type   4
-----------------------------------------``````

`colnames` has fewer levels just because we have fewer predictors, but note that `cmdset` now has two levels instead of one. Also note that `female` is still a dimension even though it’s no longer a predictor.

Now try making a table similar to the last one (without including the number of observations):

``collect layout (colname#result[_r_b _r_se])``
``````
Collection: default
Rows: colname#result[_r_b _r_se]

Your layout specification does not uniquely match any items. One or both of the
dimensions cmdset and female might help uniquely match items.``````

This fails because there are now two sets of results: one for each by group. It would make sense to put them in two columns, but we need a column dimension.

The error message helpfully suggests looking at `cmdset` and `female`. Start with the levels of `cmdset`:

``collect levelsof cmdset``
``````
Collection: default
Dimension: cmdset
Levels: 1 2``````

This does look like an identifier, but not a very enlightening one. Check the labels:

``collect label list cmdset``
``````
Collection: default
Dimension: cmdset
Label: Command results index``````

This confirms the purpose of `cmdset`: it does indeed identify (number) the commands run. You could apply labels to the values of `cmdset` to make useful column titles. But instead consider `female`:

``collect levelsof female``
``````
Collection: default
Dimension: female
Levels: 0 1``````
``collect label list female``
``````
Collection: default
Dimension: female
Label: Female
Level labels:
0  Male
1  Female``````

That will make much better column titles:

``collect layout (colname#result[_r_b _r_se]) (female)``
``````
Collection: default
Rows: colname#result[_r_b _r_se]
Columns: female
Table 1: 6 x 2

---------------------------------
|     Male   Female
--------------+------------------
Age           |
Coefficient | 899.0809 327.3002
Std. error  | 43.26707  25.0638
Intercept     |
Coefficient | 6059.225 11559.91
Std. error  | 1817.915   1056.3
---------------------------------``````

So how did this work? The `by:` prefix runs the command after it multiple times, once for each level of the by variable. In this case, the regression was run twice since `female` has two levels. `collect` collected the results of each command, and in addition to all the usual tags tagged them with `cmdset[1]` for the first command run and `cmdset[2]` for the second command run. However, it also understood that these were the result of a `by female:` prefix, so it also tagged the two sets of results with `female[0]` and `female[1]` respectively. Thus we can use either `cmdset` or `female` for columns, but `female` already has nice value labels applied.

### Exercise 3

Regress `income` on `i.edu`, with separate models for Hispanic and Not Hispanic. Just for variety, make separate tables for each model rather than two columns (recall that the third dimension is separate tables). These models will have very different sample sizes so include sample size in the table, but don’t worry about making it last. (Solution)

## 2.4 Collecting Results From Multiple Commands

Up to this point we’ve only used the `collect:` prefix to collect results once per table. But if we run it more than once, the new results will simply be added to the table. Thus we can collect as many results as we need from as many commands as we need.

Suppose you wanted to run a set of nested OLS regression models for income and then display the results in a single table, one model per column. You’ll need a dimension to keep track of the models and label their columns. You could use the `cmdset` dimension that will be created automatically, but it’s clearer if you create your own dimension. You can also set the levels to be what you want to appear in the column headers.

To do so, create a new collection, then run the first model with the `collect:` prefix. Start with `female` and `education` as predictors (both factor variables).

The `collect:` prefix takes a `tag()` option that lets you specify a tag that will be associated with all the results of the following command. Call the dimension `model`. This will be model one, but in the table we want that to show up as `(1)` so make that the name of the level.

``````collect clear
collect, tag(model[(1)]): reg income i.female i.edu``````
``````

Source |       SS           df       MS      Number of obs   =    15,158
-------------+----------------------------------   F(5, 15152)     =    656.63
Model |  4.5380e+12         5  9.0759e+11   Prob > F        =    0.0000
Residual |  2.0943e+13    15,152  1.3822e+09   R-squared       =    0.1781
Total |  2.5481e+13    15,157  1.6811e+09   Root MSE        =     37178

-------------------------------------------------------------------------------
income | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
--------------+----------------------------------------------------------------
female |
Female  |  -17536.46   605.1548   -28.98   0.000    -18722.64   -16350.28
|
edu |
High Schoo..  |   7976.849   940.9349     8.48   0.000     6132.503    9821.194
Some College  |   14402.38   924.3338    15.58   0.000     12590.58    16214.19
Bachelor's  |   32385.91   1054.114    30.72   0.000     30319.72     34452.1
Advanced D..  |   53206.38   1270.507    41.88   0.000     50716.03    55696.73
|
_cons |   25628.12   790.4475    32.42   0.000     24078.75    27177.49
-------------------------------------------------------------------------------``````

For model two add `age` as a predictor, and collect the results with the tag `model[(2)]`:

``collect, tag(model[(2)]): reg income i.female i.edu age``
``````
Source |       SS           df       MS      Number of obs   =    15,158
-------------+----------------------------------   F(6, 15151)     =    643.63
Model |  5.1756e+12         6  8.6260e+11   Prob > F        =    0.0000
Residual |  2.0305e+13    15,151  1.3402e+09   R-squared       =    0.2031
Total |  2.5481e+13    15,157  1.6811e+09   Root MSE        =     36609

-------------------------------------------------------------------------------
income | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
--------------+----------------------------------------------------------------
female |
Female  |  -17635.83   595.9085   -29.59   0.000    -18803.88   -16467.77
|
edu |
High Schoo..  |   7789.569   926.5708     8.41   0.000     5973.378    9605.759
Some College  |   14988.99   910.5812    16.46   0.000     13204.14    16773.84
Bachelor's  |   32429.34    1037.98    31.24   0.000     30394.78    34463.91
Advanced D..  |   50349.74   1257.894    40.03   0.000     47884.12    52815.36
|
age |   520.3155   23.85438    21.81   0.000      473.558    567.0729
_cons |   4894.683    1228.56     3.98   0.000     2486.557    7302.809
-------------------------------------------------------------------------------``````

For model three add age squared by replacing `age` with `c.age##c.age`:

``collect, tag(model[(3)]): reg income i.female i.edu c.age##c.age``
``````
Source |       SS           df       MS      Number of obs   =    15,158
-------------+----------------------------------   F(7, 15150)     =    582.65
Model |  5.4048e+12         7  7.7211e+11   Prob > F        =    0.0000
Residual |  2.0076e+13    15,150  1.3252e+09   R-squared       =    0.2121
Total |  2.5481e+13    15,157  1.6811e+09   Root MSE        =     36403

-------------------------------------------------------------------------------
income | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
--------------+----------------------------------------------------------------
female |
Female  |  -17533.85   592.6066   -29.59   0.000    -18695.43   -16372.27
|
edu |
High Schoo..  |    6753.61   924.7195     7.30   0.000     4941.048    8566.171
Some College  |   13655.69    911.117    14.99   0.000     11869.79    15441.59
Bachelor's  |   30414.44    1043.45    29.15   0.000     28369.15    32459.73
Advanced D..  |   48301.99   1260.472    38.32   0.000     45831.31    50772.67
|
age |   2519.698   153.8792    16.37   0.000     2218.076     2821.32
|
c.age#c.age |  -24.46875   1.860689   -13.15   0.000    -28.11593   -20.82158
|
_cons |  -30905.68   2983.924   -10.36   0.000    -36754.54   -25056.83
-------------------------------------------------------------------------------``````

Before creating the table, look at the dimensions of of the collection and the levels of `model`:

``collect dims``
``````
Collection dimensions
Collection: default
-----------------------------------------
Dimension   No. levels
-----------------------------------------
cmdset   3
coleq   1
colname   10
colname_remainder   1
edu   5
female   2
model   3
program_class   1
result   32
result_type   3
rowname   1

Style only
border_block   4
cell_type   4
-----------------------------------------``````
``collect levelsof model``
``````
Collection: default
Dimension: model
Levels: (1) (2) (3)``````

Finally set the table layout, with same rows as we’ve used previously and `model` for columns:

``collect layout (colname#result[_r_b _r_se]) (model)``
``````
Collection: default
Rows: colname#result[_r_b _r_se]
Columns: model
Table 1: 30 x 3

-----------------------------------------------------
|       (1)       (2)       (3)
----------------------+------------------------------
Male                  |
Coefficient         |         0         0         0
Std. error          |         0         0         0
Female                |
Coefficient         | -17536.46 -17635.83 -17533.85
Std. error          |  605.1548  595.9085  592.6066
Less Than High School |
Coefficient         |         0         0         0
Std. error          |         0         0         0
High School Diploma   |
Coefficient         |  7976.849  7789.569   6753.61
Std. error          |  940.9349  926.5708  924.7195
Some College          |
Coefficient         |  14402.38  14988.99  13655.69
Std. error          |  924.3338  910.5812   911.117
Bachelor's            |
Coefficient         |  32385.91  32429.34  30414.44
Std. error          |  1054.114   1037.98   1043.45
Coefficient         |  53206.38  50349.74  48301.99
Std. error          |  1270.507  1257.894  1260.472
Age                   |
Coefficient         |            520.3155  2519.698
Std. error          |            23.85438  153.8792
Age # Age             |
Coefficient         |                     -24.46875
Std. error          |                      1.860689
Intercept             |
Coefficient         |  25628.12  4894.683 -30905.68
Std. error          |  790.4475   1228.56  2983.924
-----------------------------------------------------``````

### Exercise 4

Add another model (model four) that includes age cubed. But to help you decide if adding yet another term to the model is a good idea or not, add R-squared to the table so you can tell if adding age-cubed increased R-squared significantly. Don’t worry about putting R-squared last. (Solution)