4  Do Files

Do files are simply text files whose names end with .do and which contain Stata commands. Sometimes people call them programs, though Stata uses this term for something else.

Working in Stata involves three different things: the commands you run, the data they act on, and the results they produce. A properly written do file manages all three. It contains all the commands needed to carry out its work. It loads the appropriate data and saves new versions of the data when needed. And it stores all the results in a permanent log file.

Do files are the key to reproducibility: a properly written do file (or set of do files) will reproduce your research at will. But they also make working with Stata much more efficient. At the most basic level, writing a do file ensures you can quit for the night and pick up where you left off the next day without having to start over. If you change your mind about what you want to do, simply change the do file and run it again. If you find you’ve made a mistake, fix the do file and run it again.

Start the Do File Editor by clicking on the button that looks like a pencil writing in a notebook or by typing doedit.

4.1 Anatomy of a Do File

Almost all do files carry out the same basic steps.

4.1.1 Create a Log File to Store Results

The first thing your do file should do is set up a log file which will store its results. Make sure that no previous log files are still open with:

capture log close

This is important because if your do file crashes before it gets to the command to close its log at the end, it will leave the log file open.

The capture prefix tells Stata to ignore any error messages the following command produces. In this case, we use it because we want the do file to proceed whether there’s an open log to close or not.

Then open a new log file. We suggest giving a log file the same name as the do file whose results it records, so there’s never any confusion about which log goes with which do file. To give your do file a name, press Ctrl-s or click File, Save as and call it first.do, being sure to save it in Stata’s working directory (the one in the lower left corner of the main Stata window). Then go back to the do file itself and type:

log using first.log, replace

The replace option tells Stata it’s okay to replace previous versions of that file. Specifying the .log extension tells Stata you want a plain text log, which can be used by many programs.

4.1.2 Clear Stata’s Memory

Another key to reproducibility is to always start with a blank slate, so the next command should be:

clear all

This clears out any data or stored results from whatever you were doing before running this do file.

4.1.3 Open a Data Set

In general, you’ll load data with the use command. However, since we’re using the auto data set that comes with Stata, you’ll open it with sysuse:

sysuse auto

Every time you run this do file, it will load a fresh copy of the data from disk into memory. This means you don’t have to worry about any mistakes you might have made previously or keep track of the current state of data set in general.

4.1.4 Do Your Work

You’re now ready to do your work. For now just add:

list make if foreign

Of course real do files will have many more (and much more useful) commands at this point.

4.1.5 Save your Data

Do files that carry out data wrangling will change the data set, and need to save the new version of the data at the end. This do file does not change the data, but save it anyway for practice:

save autoV2, replace

The replace option again allows Stata to overwrite the output from previous attempts to run the do file.

Never, ever save your output data set over your input data set. (In other words, the starting use command and the ending save command should never act on the same file.) If you do, the data set your do file was written to work with will no longer exist. The do file may not run at all, and if it does it most likely won’t give the same results. If it turns out you made a mistake, you may have to go back to your raw data and start over.

4.1.6 Close your Log

The last line of the do file will normally be:

log close

If you don’t close the do file’s log, any commands you run after the do file finishes will be recorded in the the log. This includes if your do file crashes before reaching the log close command. That’s the reason you started your do file with capture log close: having your do file crash, fixing it, and running it again is a completely normal part of working in Stata, and you don’t want to have to manually close the log every time.

4.2 Running a Do File

The easiest way to run a do file is to press Ctrl-d in the Do File Editor, or click the icon on the far right that looks like a “play” button over some code. If you first select just part of the do file then only that part will be run.

Running parts of your code rather than the entire do file can be useful, but code taken out of context won’t always work. For example, if you run a command that creates a variable x, realize you made a mistake, and then fix it, you can’t simply select the command that creates x and run it again because x already exists. You could manually drop the existing version of x, but now you’re doing things in a non-reproducible way. Running the entire do file will eliminate this problem because it reloads the data from disk every time. If you find yourself getting confused by these kinds of issues, run the entire do file rather than a selection.

You can also tell Stata to run a do file with the do command followed by the name of the do file to run. This means do files can run other do files. It can be very helpful to write a “master” do file that runs all the other do files associated with a project, in order.

At this point you should have the following in the do file editor, saved as first.do. Copy-and-paste if needed, then run the do file by pressing Ctrl-d.

capture log close
log using first.log, replace

clear all
sysuse auto

list make if foreign

save autoV2, replace
log close
-------------------------------------------------------------------------------
      name:  <unnamed>
       log:  /home/r/rdimond/kb/stata_intro/first.log
  log type:  text
 opened on:  26 Dec 2024, 12:38:32
(1978 automobile data)

     +----------------+
     | make           |
     |----------------|
 53. | Audi 5000      |
 54. | Audi Fox       |
 55. | BMW 320i       |
 56. | Datsun 200     |
 57. | Datsun 210     |
     |----------------|
 58. | Datsun 510     |
 59. | Datsun 810     |
 60. | Fiat Strada    |
 61. | Honda Accord   |
 62. | Honda Civic    |
     |----------------|
 63. | Mazda GLC      |
 64. | Peugeot 604    |
 65. | Renault Le Car |
 66. | Subaru         |
 67. | Toyota Celica  |
     |----------------|
 68. | Toyota Corolla |
 69. | Toyota Corona  |
 70. | VW Dasher      |
 71. | VW Diesel      |
 72. | VW Rabbit      |
     |----------------|
 73. | VW Scirocco    |
 74. | Volvo 260      |
     +----------------+
file autoV2.dta saved
      name:  <unnamed>
       log:  /home/r/rdimond/kb/stata_intro/first.log
  log type:  text
 closed on:  26 Dec 2024, 12:38:32
-------------------------------------------------------------------------------

4.3 Output Files

Because you did not tell Stata where to put first.log or autoV2.dta, Stata saved them in its working directory (the directory in the lower left corner of the main Stata window). Go to that location and open the first.log file and you should see everything that your do file put in the Results window, but stored as a permanent file. (Hopefully your computer will open it in some text editor automatically, but if needed start Notepad or TextEdit and then use it to open the log.) While autoV2.dta is just a copy of auto.dta, if your do file had improved the data set the new version would be ready for use as the input for your next do file.

4.4 How long should a do file be?

For data preparation work, it’s easy to “daisy-chain” do files: dofile1 loads dataset1, modifies it, and saves it as dataset2; dofile2 loads dataset2, modifies it, and saves it as dataset3, etc. When you’re done, a master do file can run them all. Thus there’s very little downside to breaking up one long do file into two or more short do files. Our suggestion is that you keep your do files short enough that when you’re working on one of them you can easily wrap your head around it. You also want to keep do files short so they run as quickly as possible: working on a do file usually requires running it repeatedly, so moving any code that you consider “done” to a different do file will save time.

4.5 Comments

Comments are text included in a do file for the benefit of human readers, not for Stata. Comments can explain what the do file does and why, and if anyone else ever needs to read and understand your do file they’ll be very grateful for good comments. But you are the most likely beneficiary of your comments, when you have to figure out how your do file works months or years after writing it. Sadly, your past self is a bad collaborator who never answers emails.

You don’t need to comment every command—most Stata code is fairly easy to read. But be sure to comment any code that required particular cleverness to write, or you’ll need to be just as clever to figure out what it does later.

Comments need to be marked as such so that Stata will not try to execute them. /* means Stata should ignore everything until it sees */, while // means Stata should ignore the rest of that line. Here’s an example of commenting code:

// make a list of cars I might be interested in buying
list make price mpg rep78 if price<4000 | (price<5000 & rep78>3)

/*
I'm mostly interested in cheap cars,
but I'll pay more for a car with a good repair record
*/

     +--------------------------------------+
     | make             price   mpg   rep78 |
     |--------------------------------------|
  3. | AMC Spirit       3,799    22       . |
  7. | Buick Opel       4,453    26       . |
 14. | Chev. Chevette   3,299    29       3 |
 18. | Chev. Monza      3,667    24       2 |
 19. | Chev. Nova       3,955    19       3 |
     |--------------------------------------|
 20. | Dodge Colt       3,984    30       5 |
 24. | Ford Fiesta      4,389    28       4 |
 29. | Merc. Bobcat     3,829    22       4 |
 34. | Merc. Zephyr     3,291    20       3 |
 38. | Olds Delta 88    4,890    18       4 |
     |--------------------------------------|
 43. | Plym. Champ      4,425    34       5 |
 51. | Pont. Phoenix    4,424    19       . |
 57. | Datsun 210       4,589    35       5 |
 62. | Honda Civic      4,499    28       4 |
 63. | Mazda GLC        3,995    30       4 |
     |--------------------------------------|
 65. | Renault Le Car   3,895    26       3 |
 66. | Subaru           3,798    35       5 |
 68. | Toyota Corolla   3,748    31       5 |
 72. | VW Rabbit        4,697    25       4 |
     +--------------------------------------+

A useful programmer’s trick is to “comment out” code you don’t want to run right now but don’t want to delete entirely. For example, if you temporarily wanted to focus on just the cars that meet the price<4000 condition, you could change that command to:

list make price mpg rep78 if price<4000 // | (price<5000 & rep78>3)

     +--------------------------------------+
     | make             price   mpg   rep78 |
     |--------------------------------------|
  3. | AMC Spirit       3,799    22       . |
 14. | Chev. Chevette   3,299    29       3 |
 18. | Chev. Monza      3,667    24       2 |
 19. | Chev. Nova       3,955    19       3 |
 20. | Dodge Colt       3,984    30       5 |
     |--------------------------------------|
 29. | Merc. Bobcat     3,829    22       4 |
 34. | Merc. Zephyr     3,291    20       3 |
 63. | Mazda GLC        3,995    30       4 |
 65. | Renault Le Car   3,895    26       3 |
 66. | Subaru           3,798    35       5 |
     |--------------------------------------|
 68. | Toyota Corolla   3,748    31       5 |
     +--------------------------------------+

When you’re ready to return to the original command, just remove the comment markers.

Three forward slashes (///) means that the current command is continued on the next line. This allows you to break up commands over multiple lines for readability:

list make price mpg rep78 ///
    if price<4000 | (price<5000 & rep78>3)

     +--------------------------------------+
     | make             price   mpg   rep78 |
     |--------------------------------------|
  3. | AMC Spirit       3,799    22       . |
  7. | Buick Opel       4,453    26       . |
 14. | Chev. Chevette   3,299    29       3 |
 18. | Chev. Monza      3,667    24       2 |
 19. | Chev. Nova       3,955    19       3 |
     |--------------------------------------|
 20. | Dodge Colt       3,984    30       5 |
 24. | Ford Fiesta      4,389    28       4 |
 29. | Merc. Bobcat     3,829    22       4 |
 34. | Merc. Zephyr     3,291    20       3 |
 38. | Olds Delta 88    4,890    18       4 |
     |--------------------------------------|
 43. | Plym. Champ      4,425    34       5 |
 51. | Pont. Phoenix    4,424    19       . |
 57. | Datsun 210       4,589    35       5 |
 62. | Honda Civic      4,499    28       4 |
 63. | Mazda GLC        3,995    30       4 |
     |--------------------------------------|
 65. | Renault Le Car   3,895    26       3 |
 66. | Subaru           3,798    35       5 |
 68. | Toyota Corolla   3,748    31       5 |
 72. | VW Rabbit        4,697    25       4 |
     +--------------------------------------+

From now on we’ll do everything using do files.