When you create a table using the table command, Stata creates a corresponding collection. A collection is a structured set of results that can be used to make a table. The collect command with its many subcommands acts directly on the collection. You can also use collect to create a new collection, collect the results of Stata commands, and print out the resulting table into it without ever running the table command.
2.1 Collection Properties
A collection is defined by its dimensions. Consider a basic frequency table:
------------------------------------
| Frequency
------------------------+-----------
Educational Attainment |
Less Than High School | 2,488
High School Diploma | 4,205
Some College | 4,667
Bachelor's | 2,492
Advanced Degree | 1,306
Total | 15,158
------------------------------------
You can see the dimensions it contains by running collect dims:
collect dims
Collection dimensions
Collection: Table
-----------------------------------------
Dimension No. levels
-----------------------------------------
Layout, style, header, label
cmdset 1
command 1
edu 6
result 1
statcmd 1
Style only
border_block 4
cell_type 4
-----------------------------------------
Two of the dimensions will be familiar from the last chapter: edu and result. You can see the contents of a dimension with collect levelsof:
The edu dimension contains the levels of the edu variable plus .m for the total. The result dimension contains just frequency because that’s the only statistic this table calculates.
collect levelsof result
Collection: Table
Dimension: result
Levels: frequency
Stata calls a combination of dimension and level a tag. To refer to a specific tag, you give the dimension name followed by the level in brackets, like edu[1] or result[frequency].
The edu variable in the data set has value labels applied to it, so the edu dimension also has value labels. You can see them with:
collect labellist edu
Collection: Table
Dimension: edu
Label: Educational Attainment
Level labels:
.m Total
1 Less Than High School
2 High School Diploma
3 Some College
4 Bachelor's
5 Advanced Degree
The result dimension also has a label that was applied automatically, just to make the resulting table look better:
collect labellist result
Collection: Table
Dimension: result
Label: Result
Level labels:
frequency Frequency
You can see the structure of the table with collect layout. This will also print the table:
collect layout
Collection: Table
Rows: edu
Columns: result
Table 1: 7 x 1
------------------------------------
| Frequency
------------------------+-----------
Educational Attainment |
Less Than High School | 2,488
High School Diploma | 4,205
Some College | 4,667
Bachelor's | 2,492
Advanced Degree | 1,306
Total | 15,158
------------------------------------
Exploring the other dimensions of the collection and their levels is a great way to learn more about how collections work.
Exercise 1
Identify the levels and labels of the dimension command. Can you guess what it means? (Solution)
2.2 Creating Collections
When you run table it creates a collection called Table automatically. You can create a new collection from scratch with collect create. Create one called table2 with:
collect create table2
(current collection is table2)
This also makes table2 the current table, meaning collect commands now act on it. For example, collect dims now returns nothing because table2 does not have any dimensions:
collect dims
You can use collect set to change the current table:
collect set Tablecollect dims
(current collection is Table)
Collection dimensions
Collection: Table
-----------------------------------------
Dimension No. levels
-----------------------------------------
Layout, style, header, label
cmdset 1
command 1
edu 6
result 1
statcmd 1
Style only
border_block 4
cell_type 4
-----------------------------------------
Now you’re back to the original Table.
On the other hand, if you clear the current collection from memory, Stata will automatically create a new one called default as soon as you store some results. So unless you plan to work with more than one table at a time, the easy way to create a new table is to run:
collect clear
2.3 Putting Results in a Collection
If you put the collect prefix in front of a command, the results of the command will be added to the current collection. Similarly, the collect get command will put the results of the previous command in the collection.
Put the results of a basic regression in the default collection with:
collect: reg income age i.female
Source | SS df MS Number of obs = 15,158
-------------+---------------------------------- F(2, 15155) = 667.49
Model | 2.0629e+12 2 1.0314e+12 Prob > F = 0.0000
Residual | 2.3418e+13 15,155 1.5452e+09 R-squared = 0.0810
-------------+---------------------------------- Adj R-squared = 0.0808
Total | 2.5481e+13 15,157 1.6811e+09 Root MSE = 39310
------------------------------------------------------------------------------
income | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
age | 617.9673 25.37219 24.36 0.000 568.2348 667.6999
|
female |
Female | -17459.19 638.7791 -27.33 0.000 -18711.27 -16207.1
_cons | 17333 1110.948 15.60 0.000 15155.41 19510.59
------------------------------------------------------------------------------
Now take a look at the dimensions of the new collection:
The two most important dimensions are colname and result. To see what they contain, start by listing the levels of colname:
collect levelsof colname
Collection: default
Dimension: colname
Levels: age 0.female 1.female _cons
So colname lists the predictors in the regression model. Of course you’ll usually put them in rows. The dimension is called colname because it contains the column names from matrices of results produced by the regression.
The names of the levels in result are kind of cryptic, so list their labels:
collect labellist result
Collection: default
Dimension: result
Label: Result
Level labels:
F F statistic
N Number of observations
_r_b Coefficient
_r_ci __LEVEL__% CI
_r_df df
_r_lb __LEVEL__% lower bound
_r_p p-value
_r_se Std. error
_r_ub __LEVEL__% upper bound
_r_z t
_r_z_abs |t|
beta Standardized coefficient
cmd Command
cmdline Command line as typed
depvar Dependent variable
df_m Model DF
df_r Residual DF
estat_cmd Program used to implement estat
ll Log likelihood
ll_0 Log likelihood, constant-only model
marginsok Predictions allowed by margins
model Model
mss Model sum of squares
predict Program used to implement predict
properties Command properties
r2 R-squared
r2_a Adjusted R-squared
rank Rank of VCE
rmse RMSE
rss Residual sum of squares
title Title of output
vce SE method
The result dimension contains everything there is to know about the model. Some of the results you probably care about a great deal, like the coefficients, and some are only of interest to Stata itself.
If you’re familiar with how Stata commands store results in the e() and r() vectors, some of these levels will be familiar. But a lot of them have an additional _r_ in front of them, like _r_b. Think of the _r_ as standing for “reported.” If you run the logit command with the or option, it will report odds ratios in in its results but store the coefficients in e(b). However, _r_b will contain the odds ratios because that is what was reported. The collect command always collects whatever was reported.
Before you can print out this collection you need to specify a layout. This is done with the collect layout command followed by specifying the rows and columns just like with the table command. Begin by making colname (the predictors) rows and result columns:
This has more columns than you really want, but still doesn’t include most of the levels of result. What happened to the rest? To see them, make result the rows:
collect layout (result)
Collection: default
Rows: result
Table 1: 23 x 1
-----------------------------------------------------------------
F statistic | 667.4868
Number of observations | 15158
Standardized coefficient | .189671
Command | regress
Command line as typed | regress income age i.female
Dependent variable | income
Model DF | 2
Residual DF | 15155
Program used to implement estat | regress_estat
Log likelihood | -181866.7
Log likelihood, constant-only model | -182506.5
Predictions allowed by margins | XB default
Model | ols
Model sum of squares | 2.06e+12
Program used to implement predict | regres_p
Command properties | b V
R-squared | .0809567
Adjusted R-squared | .0808354
Rank of VCE | 3
RMSE | 39309.62
Residual sum of squares | 2.34e+13
Title of output | Linear regression
SE method | ols
-----------------------------------------------------------------
This gets us all the missing levels, but basics like coefficients and standard errors are not included. What’s going on?
The problem will be familiar to anyone who uses hierarchical data. Results like “Number of observations” have just one value, while results like coefficients have one value per predictor–or in collection terms, one value per colname. The regression coefficient for age, for example, has both the tag colname[age] and the tag result[_r_b]. A table with the layout (colname) (result) will only include elements that have both a colname tag and a result tag, while a table with the layout (result) will only include elements that only have a result tag.
So what do you do if you want to have predictors as rows plus a row for number of observations? Give number of observations a colnames tag with collect addtags:
collect addtags colname[N], fortags(result[N])
(1 items changed in collection default)
This adds the tag colnames[N] to the value already tagged with result[N]. Alternatively you could call it Obs or whatever you want to appear in the table. (You’ll learn how to apply a label to it in the next chapter.)
To better understand what this does, use the layout (colname) (result[_r_b _r_se N]. Recall that square brackets allow us to choose which levels of a dimension are included in the table.
collect layout (colname) (result[_r_b _r_se N])
Collection: default
Rows: colname
Columns: result[_r_b _r_se N]
Table 1: 5 x 3
---------------------------------------------------------
| Coefficient Std. error Number of observations
----------+----------------------------------------------
Age | 617.9673 25.37219
Male | 0 0
Female | -17459.19 638.7791
N | 15158
Intercept | 17333 1110.948
---------------------------------------------------------
Note how colname[age], colname[female], and colname[_cons] (the constant) have values tagged with result[_r_b] (“Coefficient”) and result[_r_se] (“Std. error”) but not result[N]. Likewise, colname[N] only has a value tagged with result[N]. This obviously doesn’t make for a very good table.
Also note that by default the intercept is always last. To make colname[N] the last row you need to specify the order you want the colname levels to appear in, again by putting them in brackets.
You can make a very nice table out of this collection by making result into rows and colname into super rows. This done almost like with the table command, but with collect layout you put a # between the super rows and rows, like specifying an interaction in a regression model. Don’t forget to specify the levels to include and the order to put them in:
Collection: default
Rows: colname[age female _cons N]#result[_r_b _r_se N]
Table 1: 14 x 1
------------------------------------
Age |
Coefficient | 617.9673
Std. error | 25.37219
Male |
Coefficient | 0
Std. error | 0
Female |
Coefficient | -17459.19
Std. error | 638.7791
Intercept |
Coefficient | 17333
Std. error | 1110.948
N |
Number of observations | 15158
------------------------------------
Now the missing missing tags don’t matter. You’ll learn how to polish this table in the next chapter.
Exercise 2
Add log likelihood to to the table. Also, the colname level female consists of 0.female and 1.female but 0.female is not of interest, so only ask for 1.female. (Solution)
2.3.1 Collect and By
Now consider running separate regressions for males and females. You can easily do that, and create a table with both results, by combining the collect: and by: prefixes.
collect clearcollect: bysort female: reg income age
--------------------------------------------------------------------------------
-> female = Male
Source | SS df MS Number of obs = 7,771
-------------+---------------------------------- F(1, 7769) = 431.80
Model | 9.8638e+11 1 9.8638e+11 Prob > F = 0.0000
Residual | 1.7747e+13 7,769 2.2844e+09 R-squared = 0.0527
-------------+---------------------------------- Adj R-squared = 0.0525
Total | 1.8734e+13 7,770 2.4110e+09 Root MSE = 47795
------------------------------------------------------------------------------
income | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
age | 899.0809 43.26707 20.78 0.000 814.2658 983.896
_cons | 6059.225 1817.915 3.33 0.001 2495.623 9622.828
------------------------------------------------------------------------------
--------------------------------------------------------------------------------
-> female = Female
Source | SS df MS Number of obs = 7,387
-------------+---------------------------------- F(1, 7385) = 170.53
Model | 1.2642e+11 1 1.2642e+11 Prob > F = 0.0000
Residual | 5.4749e+12 7,385 741357969 R-squared = 0.0226
-------------+---------------------------------- Adj R-squared = 0.0224
Total | 5.6014e+12 7,386 758374186 Root MSE = 27228
------------------------------------------------------------------------------
income | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
age | 327.3002 25.0638 13.06 0.000 278.168 376.4324
_cons | 11559.91 1056.3 10.94 0.000 9489.265 13630.56
------------------------------------------------------------------------------
Now examine the dimensions of the resulting collection:
colnames has fewer levels just because we have fewer predictors, but note that cmdset now has two levels instead of one. Also note that female is still a dimension even though it’s no longer a predictor.
Now try making a table similar to the last one (without including the number of observations):
collect layout (colname#result[_r_b _r_se])
Collection: default
Rows: colname#result[_r_b _r_se]
Your layout specification does not uniquely match any items. One or both of the
dimensions cmdset and female might help uniquely match items.
This fails because there are now two sets of results: one for each by group. It would make sense to put them in two columns, but we need a column dimension.
The error message helpfully suggests looking at cmdset and female. Start with the levels of cmdset:
collect levelsof cmdset
Collection: default
Dimension: cmdset
Levels: 1 2
This does look like an identifier, but not a very enlightening one. Check the labels:
collect labellist cmdset
Collection: default
Dimension: cmdset
Label: Command results index
This confirms the purpose of cmdset: it does indeed identify (number) the commands run. You could apply labels to the values of cmdset to make useful column titles. But instead consider female:
So how did this work? The by: prefix runs the command after it multiple times, once for each level of the by variable. In this case, the regression was run twice since female has two levels. collect collected the results of each command, and in addition to all the usual tags tagged them with cmdset[1] for the first command run and cmdset[2] for the second command run. However, it also understood that these were the result of a by female: prefix, so it also tagged the two sets of results with female[0] and female[1] respectively. Thus we can use either cmdset or female for columns, but female already has nice value labels applied.
Exercise 3
Regress income on i.edu, with separate models for Hispanic and Not Hispanic. Just for variety, make separate tables for each model rather than two columns (recall that the third dimension is separate tables). These models will have very different sample sizes so include sample size in the table, but don’t worry about making it last. (Solution)
2.4 Collecting Results From Multiple Commands
Up to this point we’ve only used the collect: prefix to collect results once per table. But if we run it more than once, the new results will simply be added to the table. Thus we can collect as many results as we need from as many commands as we need.
Suppose you wanted to run a set of nested OLS regression models for income and then display the results in a single table, one model per column. You’ll need a dimension to keep track of the models and label their columns. You could use the cmdset dimension that will be created automatically, but it’s clearer if you create your own dimension. You can also set the levels to be what you want to appear in the column headers.
To do so, create a new collection, then run the first model with the collect: prefix. Start with female and education as predictors (both factor variables).
The collect: prefix takes a tag() option that lets you specify a tag that will be associated with all the results of the following command. Call the dimension model. This will be model one, but in the table we want that to show up as (1) so make that the name of the level.
collect clearcollect, tag(model[(1)]): reg income i.female i.edu
Source | SS df MS Number of obs = 15,158
-------------+---------------------------------- F(5, 15152) = 656.63
Model | 4.5380e+12 5 9.0759e+11 Prob > F = 0.0000
Residual | 2.0943e+13 15,152 1.3822e+09 R-squared = 0.1781
-------------+---------------------------------- Adj R-squared = 0.1778
Total | 2.5481e+13 15,157 1.6811e+09 Root MSE = 37178
-------------------------------------------------------------------------------
income | Coefficient Std. err. t P>|t| [95% conf. interval]
--------------+----------------------------------------------------------------
female |
Female | -17536.46 605.1548 -28.98 0.000 -18722.64 -16350.28
|
edu |
High Schoo.. | 7976.849 940.9349 8.48 0.000 6132.503 9821.194
Some College | 14402.38 924.3338 15.58 0.000 12590.58 16214.19
Bachelor's | 32385.91 1054.114 30.72 0.000 30319.72 34452.1
Advanced D.. | 53206.38 1270.507 41.88 0.000 50716.03 55696.73
|
_cons | 25628.12 790.4475 32.42 0.000 24078.75 27177.49
-------------------------------------------------------------------------------
For model two add age as a predictor, and collect the results with the tag model[(2)]:
collect, tag(model[(2)]): reg income i.female i.edu age
Source | SS df MS Number of obs = 15,158
-------------+---------------------------------- F(6, 15151) = 643.63
Model | 5.1756e+12 6 8.6260e+11 Prob > F = 0.0000
Residual | 2.0305e+13 15,151 1.3402e+09 R-squared = 0.2031
-------------+---------------------------------- Adj R-squared = 0.2028
Total | 2.5481e+13 15,157 1.6811e+09 Root MSE = 36609
-------------------------------------------------------------------------------
income | Coefficient Std. err. t P>|t| [95% conf. interval]
--------------+----------------------------------------------------------------
female |
Female | -17635.83 595.9085 -29.59 0.000 -18803.88 -16467.77
|
edu |
High Schoo.. | 7789.569 926.5708 8.41 0.000 5973.378 9605.759
Some College | 14988.99 910.5812 16.46 0.000 13204.14 16773.84
Bachelor's | 32429.34 1037.98 31.24 0.000 30394.78 34463.91
Advanced D.. | 50349.74 1257.894 40.03 0.000 47884.12 52815.36
|
age | 520.3155 23.85438 21.81 0.000 473.558 567.0729
_cons | 4894.683 1228.56 3.98 0.000 2486.557 7302.809
-------------------------------------------------------------------------------
For model three add age squared by replacing age with c.age##c.age:
collect, tag(model[(3)]): reg income i.female i.edu c.age##c.age
Source | SS df MS Number of obs = 15,158
-------------+---------------------------------- F(7, 15150) = 582.65
Model | 5.4048e+12 7 7.7211e+11 Prob > F = 0.0000
Residual | 2.0076e+13 15,150 1.3252e+09 R-squared = 0.2121
-------------+---------------------------------- Adj R-squared = 0.2117
Total | 2.5481e+13 15,157 1.6811e+09 Root MSE = 36403
-------------------------------------------------------------------------------
income | Coefficient Std. err. t P>|t| [95% conf. interval]
--------------+----------------------------------------------------------------
female |
Female | -17533.85 592.6066 -29.59 0.000 -18695.43 -16372.27
|
edu |
High Schoo.. | 6753.61 924.7195 7.30 0.000 4941.048 8566.171
Some College | 13655.69 911.117 14.99 0.000 11869.79 15441.59
Bachelor's | 30414.44 1043.45 29.15 0.000 28369.15 32459.73
Advanced D.. | 48301.99 1260.472 38.32 0.000 45831.31 50772.67
|
age | 2519.698 153.8792 16.37 0.000 2218.076 2821.32
|
c.age#c.age | -24.46875 1.860689 -13.15 0.000 -28.11593 -20.82158
|
_cons | -30905.68 2983.924 -10.36 0.000 -36754.54 -25056.83
-------------------------------------------------------------------------------
Before creating the table, look at the dimensions of of the collection and the levels of model:
Add another model (model four) that includes age cubed. But to help you decide if adding yet another term to the model is a good idea or not, add R-squared to the table so you can tell if adding age-cubed increased R-squared significantly. Don’t worry about putting R-squared last. (Solution)