8 Project Management
Most Stata work takes place in the context of a research project. In a typical project, you have a research question you want to answer and a dataset that you think will answer it, but the dataset isn’t in a form that can actually answer the question–yet. Good project management will help you get from raw data to completed analysis efficiently and reproducibly.
8.1 Simple Best Practices
Books have been written about how to manage research projects properly. While we won’t go into that level of detail here, we will suggest a few simple best practices that can save a tremendous amount of time and reduce the probability of making serious mistakes.
8.1.1 Don’t Skip the First Steps
In First Steps with Your Data, we laid out a process for both cleaning up your data set and gaining a deep understanding of it. Do not skip these steps! They will save you time down the road. A particularly painful scenario they can help you avoid is discovering after months of work that your dataset cannot actually answer your research question. Yes, I have seen this happen.
8.1.2 Begin with the End in Mind
Before you write any code, think through what form the data needs to be in so you can analyze it. What should an observation represent? What variables will each observation need to contain? The answers to these questions will most likely be determined by the statistical techniques you plan to use.
8.1.3 Don’t Try to do Everything at Once
Once the goal is clear in your mind, don’t try to write one massive do file that gets you there in one step, only trying to run it once it’s “done.” If you do, the do file will most likely have a large number of bugs. Then you may find that in order to make one part work, you need to do something in a different way than you originally planned. You’ll then have to change everything that follows.
It’s far better to write a bit of code, test and debug it, then write a little more, test and debug it, and so forth. How much to write depends on how difficult what you’re writing is for you. If you’re very confident in what you’re doing, go ahead and write ten or twenty lines at a time. If what you’re doing is brand new to you, write one and then test it, or even write it interactively in the command window and only copy it to your do file once it works.
8.1.4 Split Your Code into Multiple Do Files
If a do file gets too long, as you go through the write-test-debug cycle you’ll find yourself spending a lot of time waiting for code you know is good to run so Stata can move on to the code you just added and need to test. More generally, you want to write do files that are short enough that while you’re working on one you can remember everything it does.
To break up a long do file into smaller pieces, just pick a logical stopping point, have the do file save the dataset at that point, then create a new do file that uses that dataset as its starting point. Just remember: never save your output data set over your input data set.
Avoid the practice of running parts of a do file as a substitute for breaking the do file into multiple pieces. Running part of a do file can be useful, but it’s inherently not reproducible because it depends on clicking on the right thing. It can also introduce errors, such as code that crashes because it depends on prior code that was not run or because it was run more than once.
8.1.5 Put Code for Different Purposes in Different Do Files
While data wrangling is a linear process with each step depending on what came before, exploratory analysis often has multiple independent branches as you try various things. Then, when you’ve identified the results you want to report or publish, you want the code that produces them to be as clean, clear, and concise as possible. Thus it’s best to have separate do files for each of these purposes.
For most projects there should be a “final” dataset that’s used for all analysis. (Just don’t call it “final.” That practically guarantees you’ll need to make changes to it later.) Then you can open it up interactively and try things, write do files that analyze it in different ways, and generally experiment at will without running the risk of forgetting that, for example, the do file that ran the linear regressions also did a bit more recoding. It also means you can ask someone else about your statistical analysis (say, an SSCC statistical consultant) without making them wait ten minutes for your data cleaning code to run first.
8.1.6 Check your Work
Programming errors can be subtle and very difficult to catch by just staring at your code. It’s usually more effective to spend your time comparing your results to what they should be. Of course this depends on having some sense of what they should be: be constantly on the lookout for information you can use to check your work.
Examine summary statistics and frequencies frequently as you carry out data preparation, especially when you create new variables or change the structure of your data. See if what you get is plausible. If the results change, be sure you can explain why.
Spend even more time looking at individual cases. Use the browse command, often specifying a subset of the data so you can focus on what’s currently relevant, and compare what your do file did to individual cases with what you meant it to do. If you have different types of cases, be sure to look at samples of each.
If you do find problems, looking at cases is the best way to solve them. What kinds of cases get the wrong answers? How exactly are they wrong? Figuring out those details will point you to the particular commands that need to be corrected.
8.1.7 Make your Project Reproducible
With proper organization, you should be able to reproduce your entire project at will.
Start with the data as you obtained it. Your first do file will read it in, make some changes, and save the results in a different file. Your second do file will read in the output from the first do file, make further changes, and then save its results in another separate file. Repeat until your data wrangling is complete. Then all your analysis do files will read the same final data set and analyze it in various ways.
If you discover errors or need to make changes, having a well-organized and reproducible project will save you significant amounts of time. To track down an error, run your do files one by one, checking the results after each, until the error appears. Then you’ll know which do file needs to be fixed. Once the error is corrected or the change is made, consider whether subsequent do files also need to be changed. Once all the needed changes are made, simply rerun all your do files.
Write a “manager” do file that runs all the do files required by the project, in the proper order (recall that one do file can run another simply by running the command do other_do_file). Your goal, which is usually easy to achieve, is to be able to re-run your entire project simply by re-running the manager do file. This will be very valuable to anyone else who has to work with your code, but also to the future you who has to try to remember how it all worked months or years later. Your past self is a bad collaborator who never answers emails.
8.2 Case Studies
Two stories that illustrate the importance of proper project management:
One day a professor and her research assistant came to me for assistance. They were working with census data from multiple countries over many years, so a lot of data wrangling was required to make the various data sets compatible and then combine them. The RA had been working on this data wrangling for about six months.
Then the professor decided to run some basic frequencies on the dataset they had. The results were clearly wrong. The RA must have made a mistake at some point, and they came to me hoping I’d be able to fix the problem. However, I learned that the RA had been doing all his work interactively. He would open a dataset, do things to it, and then save it. He had only a general recollection of what he had done, and had no do files, logs or intermediate datasets to fall back on. Since everything he had created was useless, the project had to be started again from the original data.
The next time I saw her, the professor introduced me to her new RA and had me teach that new RA how to work reproducibly.
On a happier note, a grad student once came to me because in preparing to present her research she discovered that the values of one variable for three observations had somehow been corrupted (I have never seen that happen before or since). I had no idea how that affected her results.
Fortunately she had done everything using reproducible do files. We got the data from the source again, checked that it was intact this time, and then she re-ran all her code. Months of work were replicated in less than 15 minutes, and she was able to proceed with her presentation.
Far more could be said about project management (we haven’t even mentioned collaborating with others). You might find J. Scott Long’s Workflow of Data Analysis Using Stata helpful.
Review the headings of this chapter and think about your current research project. How can you better implement the best practices described?
If you don’t have a research project yet, plan how you will implement those best practices when you do.
If you have a current research project with a significant amount of code, try to re-run all your code. Does it work? Is it easy to do? If not, make it work, and make it easy to do.