clear all
use https://sscc.wisc.edu/sscc/pubs/stata_tables/acs
4 Exercise Solutions
4.1 Tables
Exercise 1
Make a table containing frequencies (crosstabs) for hispanic
and race
. Which works best as the row variable and which as the column variable?
table (race) (hispanic)
-------------------------------------------------------------
| Person is Hispanic
| Not Hispanic Hispanic Total
--------------------------+----------------------------------
Race |
White | 10,854 822 11,676
Black | 1,661 30 1,691
American Indian | 86 3 89
Alaska Native | 5 5
Indigenous, Unspecified | 23 5 28
Asian | 503 1 504
Pacific Islander | 23 1 24
Other | 28 794 822
Two or more races | 218 101 319
Total | 13,401 1,757 15,158
-------------------------------------------------------------
tab (hispanic) (race)
Person is | Race
Hispanic | White Black American Alaska Na | Total
-------------+--------------------------------------------+----------
Not Hispanic | 10,854 1,661 86 5 | 13,401
Hispanic | 822 30 3 0 | 1,757
-------------+--------------------------------------------+----------
Total | 11,676 1,691 89 5 | 15,158
Person is | Race
Hispanic | Indigenou Asian Pacific I Other | Total
-------------+--------------------------------------------+----------
Not Hispanic | 23 503 23 28 | 13,401
Hispanic | 5 1 1 794 | 1,757
-------------+--------------------------------------------+----------
Total | 28 504 24 822 | 15,158
Person is | Race
Hispanic | Two or mo | Total
-------------+-----------+----------
Not Hispanic | 218 | 13,401
Hispanic | 101 | 1,757
-------------+-----------+----------
Total | 319 | 15,158
In general, the variable with more levels works best as the row variable.
Exercise 2
Create a crosstab of race
and hispanic
again, but this time only include White, Black, Asian, and Other. Include a total row but not a total column.
You can identify the values associated with those levels with:
tab race
tab race, nolabel
Race | Freq. Percent Cum.
------------------------+-----------------------------------
White | 11,676 77.03 77.03
Black | 1,691 11.16 88.18
American Indian | 89 0.59 88.77
Alaska Native | 5 0.03 88.80
Indigenous, Unspecified | 28 0.18 88.99
Asian | 504 3.32 92.31
Pacific Islander | 24 0.16 92.47
Other | 822 5.42 97.90
Two or more races | 319 2.10 100.00
------------------------+-----------------------------------
Total | 15,158 100.00
Race | Freq. Percent Cum.
------------+-----------------------------------
1 | 11,676 77.03 77.03
2 | 1,691 11.16 88.18
3 | 89 0.59 88.77
4 | 5 0.03 88.80
5 | 28 0.18 88.99
6 | 504 3.32 92.31
7 | 24 0.16 92.47
8 | 822 5.42 97.90
9 | 319 2.10 100.00
------------+-----------------------------------
Total | 15,158 100.00
table (race[1 2 6 8 .m]) (hispanic), total(hispanic)
----------------------------------
| Person is Hispanic
| Not Hispanic Hispanic
--------+-------------------------
Race |
White | 10,854 822
Black | 1,661 30
Asian | 503 1
Other | 28 794
Total | 13,401 1,757
----------------------------------
Exercise 3
Create a similar table with hispanic
for the columns, and summary statistics for age
, income
, and edu
. For income
and age
, report the 10th, 25th, 50th, 75th, and 90th percentiles rather than the mean and standard deviation, but don’t worry about labeling them. (Hint: in the stat()
option, p#
requests the #th percentile.) Report overall frequencies but not percentages.
table (var result) (hispanic), ///
frequency) ///
stat(///
stat(p10 age income) ///
stat(p25 age income) ///
stat(p50 age income) ///
stat(p75 age income) ///
stat(p90 age income) ///
stat(fvfreq edu) table-1) style(
-----------------------------------------------------------
| Person is Hispanic
| Not Hispanic Hispanic Total
------------------------+----------------------------------
| 13,401 1,757 15,158
|
Age | 23 21 23
| 31 26 30
| 41 34 40
| 50 44 50
| 58 53 58
|
Income | 5100 3900 5000
| 12100 9000 12000
| 25600 16000 24400
| 43000 28000 41100
| 68200 45000 65400
|
Educational Attainment |
Less Than High School | 1,703 785 2,488
High School Diploma | 3,770 435 4,205
Some College | 4,273 394 4,667
Bachelor's | 2,391 101 2,492
Advanced Degree | 1,264 42 1,306
-----------------------------------------------------------
4.2 Collections
Exercise 1
Identify the levels and labels of the dimension command
. Can you guess what it means?
levelsof command
collect label list command collect
Collection: Table
Dimension: command
Levels: 0
Collection: Table
Dimension: command
Label: Command option index
Level labels:
0 Statistic
Note that command
has one level called 0
, not zero levels.
The command()
option lets you put the results of running the Stata command of your choice in your table. If you do that, the command
dimension will contain the command or commands you ran. However, if you use the stat()
option, including the implicit stat(freq)
option you get if you don’t specify anything, the command
dimension will just say “Statistic”.
Exercise 2
Add log likelihood to to the table. Also, the colname
level female
consists of 0.female
and 1.female
but 0.female
is not of interest, so only ask for 1.female
.
collect addtags colname[ll], fortags(result[ll])_cons N ll]#result[_r_b _r_se N ll]) collect layout (colname[age 1.female
(level ll of dimension result not found)
(0 items changed in collection Table)
(level N of dimension colname not found)
(level _cons of dimension colname not found)
(level 1.female of dimension colname not found)
(level N of dimension result not found)
(level _r_b of dimension result not found)
(level _r_se of dimension result not found)
Collection: Table
Rows: colname[age 1.female _cons N ll]#result[_r_b _r_se N ll]
Your layout specification does not identify any items.
Exercise 3
Regress income
on i.edu
, with separate models for Hispanic and Not Hispanic. Just for variety, make separate tables for each model rather than two columns (recall that the third dimension is separate tables). These models will have very different sample sizes so include sample size in the table, but don’t worry about making it last.
clear
collect bysort hispanic: reg income i.edu
collect: N], fortags(result[N])
collect addtag colname[N]) () (hispanic) collect layout (colname#result[_r_b _r_se
--------------------------------------------------------------------------------
-> hispanic = Not Hispanic
Source | SS df MS Number of obs = 13,401
-------------+---------------------------------- F(4, 13396) = 487.20
Model | 3.0368e+12 4 7.5921e+11 Prob > F = 0.0000
Residual | 2.0875e+13 13,396 1.5583e+09 R-squared = 0.1270
-------------+---------------------------------- Adj R-squared = 0.1267
Total | 2.3912e+13 13,400 1.7845e+09 Root MSE = 39476
-------------------------------------------------------------------------------
income | Coefficient Std. err. t P>|t| [95% conf. interval]
--------------+----------------------------------------------------------------
edu |
High Schoo.. | 6431.412 1152.558 5.58 0.000 4172.236 8690.589
Some College | 12551.76 1131.253 11.10 0.000 10334.35 14769.18
Bachelor's | 30730.25 1251.713 24.55 0.000 28276.71 33183.78
Advanced D.. | 52532.11 1465.568 35.84 0.000 49659.39 55404.83
|
_cons | 18853.56 956.5787 19.71 0.000 16978.53 20728.59
-------------------------------------------------------------------------------
--------------------------------------------------------------------------------
-> hispanic = Hispanic
Source | SS df MS Number of obs = 1,757
-------------+---------------------------------- F(4, 1752) = 47.81
Model | 1.3000e+11 4 3.2499e+10 Prob > F = 0.0000
Residual | 1.1908e+12 1,752 679697080 R-squared = 0.0984
-------------+---------------------------------- Adj R-squared = 0.0964
Total | 1.3208e+12 1,756 752178737 Root MSE = 26071
-------------------------------------------------------------------------------
income | Coefficient Std. err. t P>|t| [95% conf. interval]
--------------+----------------------------------------------------------------
edu |
High Schoo.. | 6511.621 1558.325 4.18 0.000 3455.248 9567.995
Some College | 9306.481 1609.651 5.78 0.000 6149.442 12463.52
Bachelor's | 31167.33 2755.999 11.31 0.000 25761.94 36572.72
Advanced D.. | 34246.58 4129.058 8.29 0.000 26148.18 42344.98
|
_cons | 16153.76 930.5139 17.36 0.000 14328.72 17978.79
-------------------------------------------------------------------------------
(2 items changed in collection default)
Collection: default
Rows: colname#result[_r_b _r_se N]
Tables: hispanic
Table 1: 20 x 1
Table 2: 20 x 1
Not Hispanic
-----------------------------------
Less Than High School |
Coefficient | 0
Std. error | 0
High School Diploma |
Coefficient | 6431.412
Std. error | 1152.558
Some College |
Coefficient | 12551.76
Std. error | 1131.253
Bachelor's |
Coefficient | 30730.25
Std. error | 1251.713
Advanced Degree |
Coefficient | 52532.11
Std. error | 1465.568
N |
Number of observations | 13401
Intercept |
Coefficient | 18853.56
Std. error | 956.5787
-----------------------------------
Hispanic
-----------------------------------
Less Than High School |
Coefficient | 0
Std. error | 0
High School Diploma |
Coefficient | 6511.621
Std. error | 1558.325
Some College |
Coefficient | 9306.481
Std. error | 1609.651
Bachelor's |
Coefficient | 31167.33
Std. error | 2755.999
Advanced Degree |
Coefficient | 34246.58
Std. error | 4129.058
N |
Number of observations | 1757
Intercept |
Coefficient | 16153.76
Std. error | 930.5139
-----------------------------------
Exercise 4
Add another model (model four) that includes age cubed. But to help you decide if adding yet another term to the model is a good idea or not, add R-squared to the table so you can tell if adding age-cubed increased R-squared significantly. Don’t worry about putting R-squared last.
This only works properly after running the first three models:
clear
collect tag(model[(1)]): reg income i.female i.edu
collect, tag(model[(2)]): reg income i.female i.edu age
collect, tag(model[(3)]): reg income i.female i.edu c.age##c.age collect,
Source | SS df MS Number of obs = 15,158
-------------+---------------------------------- F(5, 15152) = 656.63
Model | 4.5380e+12 5 9.0759e+11 Prob > F = 0.0000
Residual | 2.0943e+13 15,152 1.3822e+09 R-squared = 0.1781
-------------+---------------------------------- Adj R-squared = 0.1778
Total | 2.5481e+13 15,157 1.6811e+09 Root MSE = 37178
-------------------------------------------------------------------------------
income | Coefficient Std. err. t P>|t| [95% conf. interval]
--------------+----------------------------------------------------------------
female |
Female | -17536.46 605.1548 -28.98 0.000 -18722.64 -16350.28
|
edu |
High Schoo.. | 7976.849 940.9349 8.48 0.000 6132.503 9821.194
Some College | 14402.38 924.3338 15.58 0.000 12590.58 16214.19
Bachelor's | 32385.91 1054.114 30.72 0.000 30319.72 34452.1
Advanced D.. | 53206.38 1270.507 41.88 0.000 50716.03 55696.73
|
_cons | 25628.12 790.4475 32.42 0.000 24078.75 27177.49
-------------------------------------------------------------------------------
Source | SS df MS Number of obs = 15,158
-------------+---------------------------------- F(6, 15151) = 643.63
Model | 5.1756e+12 6 8.6260e+11 Prob > F = 0.0000
Residual | 2.0305e+13 15,151 1.3402e+09 R-squared = 0.2031
-------------+---------------------------------- Adj R-squared = 0.2028
Total | 2.5481e+13 15,157 1.6811e+09 Root MSE = 36609
-------------------------------------------------------------------------------
income | Coefficient Std. err. t P>|t| [95% conf. interval]
--------------+----------------------------------------------------------------
female |
Female | -17635.83 595.9085 -29.59 0.000 -18803.88 -16467.77
|
edu |
High Schoo.. | 7789.569 926.5708 8.41 0.000 5973.378 9605.759
Some College | 14988.99 910.5812 16.46 0.000 13204.14 16773.84
Bachelor's | 32429.34 1037.98 31.24 0.000 30394.78 34463.91
Advanced D.. | 50349.74 1257.894 40.03 0.000 47884.12 52815.36
|
age | 520.3155 23.85438 21.81 0.000 473.558 567.0729
_cons | 4894.683 1228.56 3.98 0.000 2486.557 7302.809
-------------------------------------------------------------------------------
Source | SS df MS Number of obs = 15,158
-------------+---------------------------------- F(7, 15150) = 582.65
Model | 5.4048e+12 7 7.7211e+11 Prob > F = 0.0000
Residual | 2.0076e+13 15,150 1.3252e+09 R-squared = 0.2121
-------------+---------------------------------- Adj R-squared = 0.2117
Total | 2.5481e+13 15,157 1.6811e+09 Root MSE = 36403
-------------------------------------------------------------------------------
income | Coefficient Std. err. t P>|t| [95% conf. interval]
--------------+----------------------------------------------------------------
female |
Female | -17533.85 592.6066 -29.59 0.000 -18695.43 -16372.27
|
edu |
High Schoo.. | 6753.61 924.7195 7.30 0.000 4941.048 8566.171
Some College | 13655.69 911.117 14.99 0.000 11869.79 15441.59
Bachelor's | 30414.44 1043.45 29.15 0.000 28369.15 32459.73
Advanced D.. | 48301.99 1260.472 38.32 0.000 45831.31 50772.67
|
age | 2519.698 153.8792 16.37 0.000 2218.076 2821.32
|
c.age#c.age | -24.46875 1.860689 -13.15 0.000 -28.11593 -20.82158
|
_cons | -30905.68 2983.924 -10.36 0.000 -36754.54 -25056.83
-------------------------------------------------------------------------------
Now the exercise itself:
tag(model[(4)]): reg income i.female i.edu c.age##c.age##c.age
collect,
collect addtags colname[r2], fortags(result[r2])model) collect layout (colname#result[_r_b _r_se r2]) (
Source | SS df MS Number of obs = 15,158
-------------+---------------------------------- F(8, 15149) = 510.21
Model | 5.4083e+12 8 6.7603e+11 Prob > F = 0.0000
Residual | 2.0073e+13 15,149 1.3250e+09 R-squared = 0.2122
-------------+---------------------------------- Adj R-squared = 0.2118
Total | 2.5481e+13 15,157 1.6811e+09 Root MSE = 36401
-------------------------------------------------------------------------------
income | Coefficient Std. err. t P>|t| [95% conf. interval]
--------------+----------------------------------------------------------------
female |
Female | -17537.59 592.5788 -29.60 0.000 -18699.12 -16376.06
|
edu |
High Schoo.. | 6793.435 924.9927 7.34 0.000 4980.338 8606.533
Some College | 13680.91 911.1991 15.01 0.000 11894.85 15466.97
Bachelor's | 30520.95 1045.443 29.19 0.000 28471.76 32570.15
Advanced D.. | 48311.77 1260.418 38.33 0.000 45841.2 50782.34
|
age | 1341.909 739.6677 1.81 0.070 -107.9288 2791.747
|
c.age#c.age | 6.032787 18.82848 0.32 0.749 -30.8733 42.93888
|
c.age#c.age#|
c.age | -.246602 .1514814 -1.63 0.104 -.5435237 .0503198
|
_cons | -16935.91 9085.218 -1.86 0.062 -34744.03 872.212
-------------------------------------------------------------------------------
(4 items changed in collection default)
Collection: default
Rows: colname#result[_r_b _r_se r2]
Columns: model
Table 1: 35 x 4
---------------------------------------------------------------
| (1) (2) (3) (4)
----------------------+----------------------------------------
Male |
Coefficient | 0 0 0 0
Std. error | 0 0 0 0
Female |
Coefficient | -17536.46 -17635.83 -17533.85 -17537.59
Std. error | 605.1548 595.9085 592.6066 592.5788
Less Than High School |
Coefficient | 0 0 0 0
Std. error | 0 0 0 0
High School Diploma |
Coefficient | 7976.849 7789.569 6753.61 6793.435
Std. error | 940.9349 926.5708 924.7195 924.9927
Some College |
Coefficient | 14402.38 14988.99 13655.69 13680.91
Std. error | 924.3338 910.5812 911.117 911.1991
Bachelor's |
Coefficient | 32385.91 32429.34 30414.44 30520.95
Std. error | 1054.114 1037.98 1043.45 1045.443
Advanced Degree |
Coefficient | 53206.38 50349.74 48301.99 48311.77
Std. error | 1270.507 1257.894 1260.472 1260.418
Age |
Coefficient | 520.3155 2519.698 1341.909
Std. error | 23.85438 153.8792 739.6677
Age # Age |
Coefficient | -24.46875 6.032787
Std. error | 1.860689 18.82848
Age # Age # Age |
Coefficient | -.246602
Std. error | .1514814
r2 |
R-squared | .1780919 .2031156 .2121091 .2122469
Intercept |
Coefficient | 25628.12 4894.683 -30905.68 -16935.91
Std. error | 790.4475 1228.56 2983.924 9085.218
---------------------------------------------------------------
4.3 Appearance
Exercise
The commands below create a table of t-test results (you can also think of it as a table of summary statistics that happens to include whether the differences between the groups are significant). Make it pretty–but don’t try to add stars for significance.
clear
collect diff=(r(mu_1) - r(mu_2)), tag(var[Age]): ttest age, by(hispanic)
collect diff=(r(mu_1) - r(mu_2)), tag(var[Income]): ttest income, by(hispanic)
collect
collect addtag col[not_hispanic], fortags(result[mu_1 sd_1])
collect addtag col[hispanic], fortags(result[mu_2 sd_2])diff])
collect addtag col[Difference], fortags(result[p])
collect addtag col[pval], fortags(result[row[mean], fortags(result[mu_1 mu_2 diff p])
collect addtag row[sd], fortags(result[sd_1 sd_2])
collect addtag collect layout (var#row) (col)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. err. Std. dev. [95% conf. interval]
---------+--------------------------------------------------------------------
Not Hisp | 13,401 40.73838 .1085502 12.56607 40.5256 40.95115
Hispanic | 1,757 35.68754 .2817152 11.80854 35.135 36.24007
---------+--------------------------------------------------------------------
Combined | 15,158 40.15292 .1022155 12.58455 39.95257 40.35328
---------+--------------------------------------------------------------------
diff | 5.050842 .3166677 4.430135 5.671549
------------------------------------------------------------------------------
diff = mean(Not Hisp) - mean(Hispanic) t = 15.9500
H0: diff = 0 Degrees of freedom = 15156
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. err. Std. dev. [95% conf. interval]
---------+--------------------------------------------------------------------
Not Hisp | 13,401 35102.85 364.9115 42243.14 34387.57 35818.13
Hispanic | 1,757 22463.13 654.2966 27425.88 21179.85 23746.41
---------+--------------------------------------------------------------------
Combined | 15,158 33637.75 333.0283 41001.73 32984.97 34290.52
---------+--------------------------------------------------------------------
diff | 12639.72 1035.28 10610.44 14668.99
------------------------------------------------------------------------------
diff = mean(Not Hisp) - mean(Hispanic) t = 12.2090
H0: diff = 0 Degrees of freedom = 15156
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000
(4 items changed in collection default)
(4 items changed in collection default)
(2 items changed in collection default)
(2 items changed in collection default)
(8 items changed in collection default)
(4 items changed in collection default)
Collection: default
Rows: var#row
Columns: col
Table 1: 6 x 4
--------------------------------------------------
| not_hispanic hispanic Difference pval
-------+------------------------------------------
Age |
mean | 40.73838 35.68754 5.050842 8.26e-57
sd | 12.56607 11.80854
Income |
mean | 35102.85 22463.13 12639.72 4.03e-34
sd | 42243.14 27425.88
--------------------------------------------------
The results created by ttest
, for reference:
label list result collect
Collection: default
Dimension: result
Label: Result
Level labels:
N_1 Sample size n₁
N_2 Sample size n₂
df_t Degrees of freedom
level Confidence level
mu_1 x₁ mean for population 1
mu_2 x₂ mean for population 2
p Two-sided p-value
p_l Lower one-sided p-value
p_u Upper one-sided p-value
sd Combined std. dev.
sd_1 Standard deviation for first variable
sd_2 Standard deviation for second variable
se Std. error
t t statistic
And now the exercise proper:
label levels col ///
collect "Not Hispanic" hispanic "Hispanic" pval "P-Value", modify
not_hispanic row, level(hide)
collect style header row[sd], sformat("(%s)")
collect style cell row[mean], sformat("%s ")
collect style cell var[Age], nformat(%12.1f)
collect style cell var[Income], nformat(%12.0fc)
collect style cell
collect style cell col[pval], nformat(%8.2f) collect style cell, border(right, pattern(nil))
collect preview
-----------------------------------------------
Not Hispanic Hispanic Difference P-Value
-----------------------------------------------
Age 40.7 35.7 5.1 0.00
(12.6) (11.8)
Income 35,103 22,463 12,640 0.00
(42,243) (27,426)
-----------------------------------------------
Of course your definition of “pretty” may be different, and that’s fine.