clear all
use https://sscc.wisc.edu/sscc/pubs/real_world_tables/table1b
2 Table 1 with Hierarchical Data
This is another Table 1, but the complication this time is that it’s for hierarchical data. The data set consists of children who can have multiple placements into foster homes, and we want summary statistics both by child and by placement.
2.1 Setting up
Load the data with:
Take a moment to look it over and identify the variable names.
2.2 tag
and by
The egen
function tag()
takes an identifier variable as its argument, and then creates a new variable which is 1 for the first observation for that identifier, and zero for all the others. (Note that this has nothing to do with the tags used by collect
.)
This data set has a variable id
that identifies all the placements for a given child, so create a tag for each child with:
egen child = tag(id)
The classic use of a tag like this is in an if
condition. For example, compare:
tab sex
Sex | Freq. Percent Cum.
------------+-----------------------------------
Male | 1,717 58.88 58.88
Female | 1,199 41.12 100.00
------------+-----------------------------------
Total | 2,916 100.00
tab sex if child
Sex | Freq. Percent Cum.
------------+-----------------------------------
Male | 516 51.60 51.60
Female | 484 48.40 100.00
------------+-----------------------------------
Total | 1,000 100.00
The first table gives the frequencies of sex
for all the placements. The second gives the frequencies of sex
for all the children. The percentages are different because in this (made up) data set boys tend to have more placements.
Now create a table of descriptive statistics, adding the by(child)
option:
by(child) dtable i.race i.sex,
-----------------------------------------------------
tag(id)
0 1 Total
-----------------------------------------------------
N 1,916 (65.7%) 1,000 (34.3%) 2,916 (100.0%)
Race
White 675 (35.2%) 340 (34.0%) 1,015 (34.8%)
Black 612 (31.9%) 329 (32.9%) 941 (32.3%)
Hispanic 629 (32.8%) 331 (33.1%) 960 (32.9%)
Sex
Male 1,201 (62.7%) 516 (51.6%) 1,717 (58.9%)
Female 715 (37.3%) 484 (48.4%) 1,199 (41.1%)
-----------------------------------------------------
This gives three sets of descriptive statistics: one for the observations where child
is zero, one for the observations where child
is one, and one (Total) for all observations.
The group where child
is zero contains all the observations other than the first observation for each child. You’re not interested in the descriptive statistics for this group.
The group where child
is one contains one observation per child and thus gives child-level descriptive statistics. (Note how it matches the second set of frequencies calculated earlier.) You want these descriptive statistics.
The descriptive statistics for all observations gives placement-level descriptive statistics. You want these descriptive statistics too.
Take a look at the layout of this collection:
collect layout
Collection: DTable
Rows: var
Columns: child#result
Table 1: 8 x 3
-----------------------------------------------------
tag(id)
0 1 Total
-----------------------------------------------------
N 1,916 (65.7%) 1,000 (34.3%) 2,916 (100.0%)
Race
White 675 (35.2%) 340 (34.0%) 1,015 (34.8%)
Black 612 (31.9%) 329 (32.9%) 941 (32.3%)
Hispanic 629 (32.8%) 331 (33.1%) 960 (32.9%)
Sex
Male 1,201 (62.7%) 516 (51.6%) 1,717 (58.9%)
Female 715 (37.3%) 484 (48.4%) 1,199 (41.1%)
-----------------------------------------------------
You have one column per level of child
per level of result
. Get a list of the levels of child
with:
levelsof child collect
Collection: DTable
Dimension: child
Levels: 0 1 .m
.m
is the Total column. You’re interested in that and the 1
column, so specify that only those levels should be included. Also specify that the number of observations (_N
) should come at the end rather than the beginning by listing the levels of var
in the order you want:
var[i.race i.sex _N]) (child[1 .m]#result) collect layout (
Collection: DTable
Rows: var[i.race i.sex _N]
Columns: child[1 .m]#result
Table 1: 8 x 2
---------------------------------------
tag(id)
1 Total
---------------------------------------
Race
White 340 (34.0%) 1,015 (34.8%)
Black 329 (32.9%) 941 (32.3%)
Hispanic 331 (33.1%) 960 (32.9%)
Sex
Male 516 (51.6%) 1,717 (58.9%)
Female 484 (48.4%) 1,199 (41.1%)
N 1,000 (34.3%) 2,916 (100.0%)
---------------------------------------
Next change the labels for the child
dimension:
label levels child 1 "Child" .m "Placement", modify collect
The label tag(id)
is the variable label egen
created for the child
variable. dtable
used it as the title for the child
header. You don’t really need anything there, so hide that title:
title(hide) collect style header child,
Finally, change the label for _N
from N
to Sample Size
. (This is purely a matter of taste.)
label levels var _N "Sample Size", modify
collect collect preview
----------------------------------------
Child Placement
----------------------------------------
Race
White 340 (34.0%) 1,015 (34.8%)
Black 329 (32.9%) 941 (32.3%)
Hispanic 331 (33.1%) 960 (32.9%)
Sex
Male 516 (51.6%) 1,717 (58.9%)
Female 484 (48.4%) 1,199 (41.1%)
Sample Size 1,000 (34.3%) 2,916 (100.0%)
----------------------------------------
2.3 collect unget
Better, but the percentages on Sample Size
don’t make much sense.
As of Stata 19 you can remove items from the result
dimension with collect unget
. Just specify the result to be removed and for which tags:
percent, fortags(var[_N])
collect unget collect preview
(3 items removed from collection DTable)
-------------------------------------
Child Placement
-------------------------------------
Race
White 340 (34.0%) 1,015 (34.8%)
Black 329 (32.9%) 941 (32.3%)
Hispanic 331 (33.1%) 960 (32.9%)
Sex
Male 516 (51.6%) 1,717 (58.9%)
Female 484 (48.4%) 1,199 (41.1%)
Sample Size 1,000 2,916
-------------------------------------