Bar Graphs in Stata

Bar graphs are simple but powerful (or rather, powerful because they are simple) tools for conveying information. They can be understood at a glance by both technical and non-technical audiences, and often tell you much more than summary statistics will. This article will show you how to make a variety of useful bar graphs using Stata.

The Example Data Set

SSCC's statistical consultants have been asked to analyze several workplace surveys in recent years, so the example data we'll use has that theme (much of this article came out of our efforts to find ways to present our results to very busy leaders). You can obtain the data by typing, or more likely copying and pasting, the following in a do file:

use http://ssc.wisc.edu/sscc/pubs/bargraphs/bar_example.dta

It contains fictional data with 1,000 observations and four variables:

sat: responses to the question "In general, how satisfied are you with your job?" on a five-point scale ranging from "Very Dissatisfied" to "Very Satisfied."
eng: a numeric measure of employee engagement from 1 to 100.
leave: responses to the question "How likely are you to leave your job in the next year?" on a five-point scale ranging from "Very Likely" to "Very Unlikely."
stay: a binary variable based on leave. It is 0 if the respondent said they were likely to leave and 1 otherwise.
female: a binary variable which is 1 if the respondent is female and 0 if the respondent is male.

Distribution of a Single Variable

The most basic task of a bar graph is to help you understand the distribution of a single categorical variable. Begin with the sat variable (job satisfaction) and the most basic bar graph:

graph bar, over(sat)

The graph bar command tell Stata you want to make a bar graph, and the over() option tells it which variable defines the categories to be described. By default it will tell you the percentage of observations that fall in each category. Unfortunately, the result is not very satisfactory:

The categories are labeled using the value labels of the sat variable, but they're unreadable because they overlap. You can fix this problem easily and naturally by making the whole graph horizontal rather than vertical. Just change graph bar to graph hbar.

The y axis title "percent" is vague. Make it more clear with a ytitle() option. Note that this axis will be horizontal since you're now making a horizontal graph, but it's still referred to as the y axis.

This graph is also in dire need of an overall title, which can be added using the title() option. For graphs describing surveys, the question text is often a useful title. The title text doesn't always need to go in quotes, but this one does because it contains a comma. Without quotes, Stata will think you're trying to set title options.

Stata graph commands often get long; you can make them more readable by splitting them across multiple lines if you use /// to tell Stata the command continues on the next line. For this article, we'll put just one option per line, though some options will soon take more than one line. We'll also bold the new or changed parts of each command.

graph hbar, ///
over(sat) ///
ytitle("Percent of Respondents") ///
title("In general, how satisfied are you with your job?")

Now the problem is that the text doesn't fit in the graph. You can do several things to fix that:

Reduce the size of the category labels using the label(labsize(small)) option
Reduce the size of the y axis title using the size(small) option
Allow the title to use the space above the axis labels (and be centered across the entire space) using the span option
Reduce the size of the title using the size(medium) option

You could also split the title into multiple lines by putting each line in its own set of quotes, but that won't be necessary here.

Each of the new options goes inside the option for the thing it controls. They are options for options!

graph hbar, ///
over(sat, label(labsize(small))) ///
ytitle("Percent of Respondents", size(small)) ///
title("In general, how satisfied are you with your job?" ///
, span size(medium))

We're getting much closer, but the label "Neither Satisfied nor Dissatisfied" is still being truncated. One good solution would be to use a shorter label. On the other hand, an elegant command called splitvallabels by Nick Winter and Ben Jann will take value labels, split them into multiple lines, and make them available as an `r(relabel)' macro in a form the relabel() option can understand. (The relabel() option allows you to set category labels to whatever you want without setting value labels for the variable, but using value labels is a good practice for many reasons.) You can get splitvallabels by running:

ssc install splitvallabels

You only need to run this once—don't put it in a research do file that you'll run over and over.

splitvallabels sat
graph hbar, ///
over(sat, label(labsize(small)) relabel(`r(relabel)')) ///
ytitle("Percent of Respondents", size(small)) ///
title("In general, how satisfied are you with your job?" ///
, span size(medium))

If you are new to macros, note that the character at the left of `r(relabel)' is the left single quote, found on the left side of your keyboard under the tilde, and the character at the right of it is the right single quote, found on the right side of your keyboard under the double quote.

This fixes the truncated label and reduces the amount of space taken up by the labels in general, leaving more space for the graph.

Because the splitvallabels command puts its results in the r() vector, they'll be replaced by any other command that stores results in the r() vector, including graph. If you'll be using a set of labels repeatedly, you could store them in a separate macro:

local relabel `r(relabel)'

and then use `relabel' in subsequent commands rather than `r(relabel)'. For this article we'll instead run the splitvallabels command before each graph, so each graph can be run separately.

This is now a usable graph, but some might complain that it does not have the precision of a table giving the percentages as numbers. No problem: you can have the numbers too by adding a blabel(bar) option, meaning Stata should label each bar with the height of the bar. You'll almost certainly want to control the number of decimal places displayed with a format() option like format(%4.1f). This means the number on each bar should take up four total spaces, including the decimal point, with one number after the decimal point.

One final tweak: if someone prints this graph, the bars will use a lot of toner and, depending on the printer, the ink may streak. You can avoid that by reducing the "intensity" of the colors with the intensity() option:

If you want frequencies rather than percentages, tell graph hbar that the thing you want to plot is the (count). You'll also want to change the ytitle() and the format of the bar labels—with integers the default will do so you can just remove the format() option entirely.

splitvallabels sat
graph hbar (count), ///
over(sat, label(labsize(small)) relabel(`r(relabel)')) ///
ytitle("Number of Respondents", size(small)) ///
title("In general, how satisfied are you with your job?" ///
, span size(medium)) ///
blabel(bar) ///
intensity(25)

Tip for advanced users: When working with survey data, you may find it useful to put the text for each question in a global macro with the same name as the variable:

global sat "In general, how satisfied are you with your job?"

Then your graph commands can start with:

graph hbar, over(sat) title($sat)

This is highly convenient in loops:

foreach question of varlist q1-q10 {
graph hbar, over(`question') title($`question')
}

The macro processor will first replace the local macro `question' with a specific question, and then replace the resulting global macro with that question's text.

If you create a do file that defines global macros for all your questions, you can just put:

include question_file.do

early in any do file that will use them and they'll be ready to go.

Bar Graphs vs. Means

We often see people use means to summarize Lickert scales and other ordered categorical variables like sat. Means convey useful information at a glance, but they also hide a lot. They can also hide strong implicit assumptions. For example, having one person change from "Neither Satisfied nor Dissatisfied" to "Somewhat Satisfied" will have exactly the same effect on the mean as someone moving from "Very Dissatisfied" to "Somewhat Dissatisfied," which may or may not be equivalent in any meaningful sense.

A bar graph can be taken in at a glance just like a mean, but conveys far more information. Once you're comfortable making and using bar graphs, they're almost as easy to add to a document as a mean.

Relationships between a Categorical Variable and a Quantitative Variable

Bar graphs are also good tools for examining the relationship (joint distribution) of a categorical variable and some other variable.

To create a bar graph where the length of the bar tells you the mean value of a quantitative variable for each category, just tell graph hbar to plot that variable. If you want a different summary statistic, like the median, put that summary statistic in parentheses before the variable name just like you did with (count).

You'll also need to change the title and y axis title, and set the formatting of the bar labels.

splitvallabels sat
graph hbar eng, ///
over(sat, label(labsize(small)) relabel(`r(relabel)')) ///
ytitle("Mean Engagement", size(small)) ///
title("Mean Engagement by Job Satisfaction" ///
, span size(medium)) ///
blabel(bar, format(%4.1f)) ///
intensity(25)

This suggests a strong but nonlinear relationship.

In this plot, the length of the bar measures how far away from zero the mean is. Sometimes that isn't a meaningful measure: for example, a Lickert scale can never be zero. In those cases you might consider using a dot plot instead—just replace hbar with dot:

splitvallabels sat
graph dot eng, ///
over(sat, label(labsize(small)) relabel(`r(relabel)')) ///
ytitle("Mean Engagement", size(small)) ///
title("Mean Engagement by Job Satisfaction" ///
, span size(medium)) ///
blabel(bar, format(%4.1f)) ///
intensity(25)

The trouble with this graph is that the dots interfere with the bar labels. You can fix that by putting a white box around each label which will cover up the dots. This is done with the box option and both fcolor(white) and lcolor(white) to set the "fill" (inside) of the box and the line around it to white. The last label (85.8) will look a little funny if its white box covers up the light blue margin around the graph, so increase the range of the y axis to 90 with yscale(range(0 90)) so it no longer crosses over into the margin.

Tip for advanced users: If you'd like to have the frequencies for each category in the graph too, put them in the value labels! The following code loops over the values of sat, stores the label associated with each value, counts how many observations have that value, constructs a new label containing both the old label and the number of observations, then applies it:

levelsof sat, local(vals)
foreach val of local vals {
local oldlabel: label (sat) `val'
count if sat==`val'
label define newsat `val' "`oldlabel' (N=`r(N)')", add
}
label values sat newsat

This relies on the macro extended function label, which allows you to access value labels in various ways and store them in macros.

Just remember to change the labels back to the original when you're done:

label values sat sat

Relationships between a Categorical Variable and a Binary Variable

If you're thinking of the binary variable as an outcome, then the proportion of "successes" (whatever that means in your data set) in each group may be of interest. But, assuming the binary variable is coded such that 1 means "success" and 0 means "failure," the proportion of successes is just the mean of the binary variable and can be plotted just like the mean of eng:

splitvallabels sat
graph hbar stay, ///
over(sat, label(labsize(small)) relabel(`r(relabel)')) ///
ytitle("Proportion Staying", size(small)) ///
title("Proportion Unlikely to Leave by Job Satisfaction" ///
, span size(medium)) ///
blabel(bar, format(%4.2f)) ///
intensity(25)

If the binary variable denotes two groups you're comparing, like female, then you should consider frequencies (count) or percentages (the default) for each combination of the two variables. Start with frequencies.

The binary variable to examine will be specified in another over() option, but it makes a big difference which variable you put first. If you put first sat and then female, you'll get:

splitvallabels sat
graph hbar (count), ///
over(sat, label(labsize(small)) relabel(`r(relabel)')) ///
over(female,label(labsize(small))) ///
ytitle("Number of Respondents", size(small)) ///
title("Job Satisfaction by Gender" ///
, span size(medium)) ///
blabel(bar) ///
intensity(25)

If you put female first and then sat, you'll get:

splitvallabels sat
graph hbar (count), ///
over(female,label(labsize(small))) ///
over(sat, label(labsize(small)) relabel(`r(relabel)')) ///
ytitle("Number of Respondents", size(small)) ///
title("Job Satisfaction by Gender" ///
, span size(medium)) ///
blabel(bar) ///
intensity(25)

The latter form generally makes for easier comparisons. But in this case the only thing you can easily learn by comparing them is that more females answered the survey than males. It would be much more useful to look at what percentage of males and females are in each satisfaction category. Unfortunately, just telling graph hbar to plot percentages doesn't do that:

splitvallabels sat
graph hbar, /// percent is the default
over(female,label(labsize(small))) ///
over(sat, label(labsize(small)) relabel(`r(relabel)')) ///
ytitle("Percent of Respondents", size(small)) ///
title("Job Satisfaction by Gender" ///
, span size(medium)) ///
blabel(bar, format(%4.1f)) ///
intensity(25)

This gives you the percentages calculated across all respondents, not calculated separately for males and females. The graph hbar command does not allow you to control how the percentages are calculated.

Enter the very useful catplot, by Nick Cox. Get it with:

ssc install catplot

The catplot command is a "wrapper" for graph hbar so most of what we've done carries over directly. What it adds (among other things) is a percent() option that allows you to specify what groups percentages will be calculated over, in this case percent(female).

The catplot does some things differently than graph hbar:

The variables inside the over() options are moved to a variable list directly after the command itself
The options inside the over() options are moved into options called var1opts(), var2opts(), etc. corresponding to the variable order in the variable list

Thus the catplot version of the last command becomes:

splitvallabels sat
catplot female sat, ///
percent(female) ///
var1opts(label(labsize(small))) ///
var2opts(label(labsize(small)) relabel(`r(relabel)')) ///
ytitle("Percent of Respondents by Gender", size(small)) ///
title("Job Satisfaction by Gender" ///
, span size(medium)) ///
blabel(bar, format(%4.1f)) ///
intensity(25)

This allows us to see that the the relationship between sat and female is complex in this (fictional) data set, with females more likely to be both very satisfied and very dissatisfied.

An alternative form of this graph uses color to distinguish between the groups, adding a legend to define their meanings. You can get this form by adding the asyvars option.

This greatly reduces the clutter on the left of the graph, at the cost of adding some to the bottom and forcing the reader to look in two places to understand what the bars mean. You should consider what the graph will look like to someone who is colorblind, and may need to think about whether the bars will be distinguishable if printed on a black and white printer (the graphs in this article fail that test). For this particular graph you may want to change the default colors to avoid unfortunate associations—you'll soon learn how.

Relationships between Two Categorical Variables

The code for creating graphs that compare two categorical variables is identical to comparing one categorical variable and one binary, but the result has a lot more bars. The bar labels will overlap unless you shrink them with a size(vsmall) option.

If you stare at this for a bit you'll see that higher levels of satisfaction are associated with lower probabilities of leaving, but it's not obvious at a glance. It would help if the graph reflected the fact that the categories of leave are ordered. Right now both the colors used and the arrangement of the categories in the legend feel arbitrary.

To fix the colors, take control of the individual bars with bar() options. Each bar will get its own option, identified by a number inside the parentheses. Then you can use options to control the properties of that bar. For example bar(1, color(maroon) fintensity(inten80)), means the first bar will be maroon (dark red) with a fill intensity of 80% (inten80 being shorthand for that). We'll make "likely to leave" red (officially maroon), "neither likely nor unlikely" gray, and and "unlikely to leave" blue (officially navy) with the degree of likelihood represented by fill intensity.

Next the legend, which is controlled by putting options within the legend() option. The category ordering will be much clearer if all the entries are in a single row, which you can do with rows(1). To make that fit, put the labels underneath the color symbols rather than next to them with stack. Shrink the text with size(small). However, we still need to break the labels into multiple lines. Unfortunately, the syntax for setting legend labels is just different enough that splitvallabels can't help you, so you'll have to do it yourself. The easy way to set a bunch of labels at once is to use the order() option:

order(1 "Very" "likely" 2 "Somewhat" "likely" ///
3 "Neither likely" "nor unlikely" ///
4 "Somewhat" "unlikely" 5 "Very" "unlikely")

This is getting picky, but everything will line up better if you use symplacement(center) to put the color symbols in the center of their space. You can add a title to the legend itself by putting a title() option inside legend(), but you'll want to make it size(small).

The resulting code is long, but each component part is straightforward:

splitvallabels sat
catplot leave sat, ///
percent(sat) ///
var1opts(label(labsize(small))) ///
var2opts(label(labsize(small)) relabel(`r(relabel)')) ///
ytitle("Percent of Respondents by Satisfaction", size(small)) ///
title("Probability of Leaving by Job Satisfaction" ///
, span size(medium)) ///
blabel(bar, format(%4.1f) size(vsmall)) ///
intensity(25) ///
asyvars ///
bar(1, color(maroon) fintensity(inten80)) ///
bar(2, color(maroon) fintensity(inten60)) ///
bar(3, color(gray) fintensity(inten40)) ///
bar(4, color(navy) fintensity(inten60)) ///
bar(5, color(navy) fintensity(inten80)) ///
legend(rows(1) stack size(small) ///
order(1 "Very" "likely" 2 "Somewhat" "likely" ///
3 "Neither likely" "nor unlikely" ///
4 "Somewhat" "unlikely" 5 "Very" "unlikely") ///
symplacement(center) ///
title(Likelihood of Leaving, size(small)))

The result is still fairly cluttered, however, and the bars to be compared aren't very close together. An alternative is to stack the bars for each category using the stack option. Unfortunately, getting bar labels to work with stacked bars is not straightforward (there won't always be space for them), so take out the blabel() option.

splitvallabels sat
catplot leave sat, ///
percent(sat) ///
var1opts(label(labsize(small))) ///
var2opts(label(labsize(small)) relabel(`r(relabel)')) ///
ytitle("Percent of Respondents by Satisfaction", size(small)) ///
title("Probability of Leaving by Job Satisfaction" ///
, span size(medium)) ///
intensity(25) ///
asyvars stack ///
bar(1, color(maroon) fintensity(inten80)) ///
bar(2, color(maroon) fintensity(inten60)) ///
bar(3, color(gray) fintensity(inten40)) ///
bar(4, color(navy) fintensity(inten60)) ///
bar(5, color(navy) fintensity(inten80)) ///
legend(rows(1) stack size(small) ///
order(1 "Very" "likely" 2 "Somewhat" "likely" ///
3 "Neither likely" "nor unlikely" ///
4 "Somewhat" "unlikely" 5 "Very" "unlikely") ///
symplacement(center) ///
title(Likelihood of Leaving, size(small)))

This graph often requires some explanation. But once people grasp that what they should be looking for is how the colors move left or right as you go up or down categories, they can see the relationship between the variables very clearly. Compare with:

tab sat leave, row

                      |                         leave
                  sat | Very like  Somewhat   Neither l  Somewhat   Very unli |     Total
----------------------+-------------------------------------------------------+----------
    Very Dissatisfied |         5         22         13          6          0 |        46 
                      |     10.87      47.83      28.26      13.04       0.00 |    100.00 
----------------------+-------------------------------------------------------+----------
Somewhat Dissatisifie |         3         24         28         22          3 |        80 
                      |      3.75      30.00      35.00      27.50       3.75 |    100.00 
----------------------+-------------------------------------------------------+----------
Neither Satisfied nor |         5         43         81         82         20 |       231 
                      |      2.16      18.61      35.06      35.50       8.66 |    100.00 
----------------------+-------------------------------------------------------+----------
   Somewhat Satisfied |         3         31         94        118         52 |       298 
                      |      1.01      10.40      31.54      39.60      17.45 |    100.00 
----------------------+-------------------------------------------------------+----------
       Very Satisfied |         5         37        100        129         74 |       345 
                      |      1.45      10.72      28.99      37.39      21.45 |    100.00 
----------------------+-------------------------------------------------------+----------
                Total |        21        157        316        357        149 |     1,000 
                      |      2.10      15.70      31.60      35.70      14.90 |    100.00

Once you've created your bar graphs, you'll need to save them in a format that you can use. Using Stata Graphs in Documents will help you with that. You may also be interested in An Introduction to Stata Graphics, which will introduce you to many other kinds of graphs, and also more options you can use with bar graphs. A do file that contains all the code for this article is available here.

Last Revised: 5/21/2018