forvalues i = 1/10 {
clear
`i'.csv
import delimited filesave file`i', replace
}
Converting Stata Loops to Parallel Loops Using Slurm (The Easy Way)
So your research requires carrying out a task multiple times, and you’ve written a Stata loop to do it. Great! That saved you a lot of time writing code–but it doesn’t speed up running the code at all. But what if instead of completing the first task, then the next, then the next, all of them could be run at once? That’s absolutely possible if you send them to the SSCC’s Slurm cluster.
This article assumes you know Stata well enough to write loops, but don’t know a lot about Slurm–in fact this article will cover everything you need to know about Slurm. Thus we won’t use advanced features of Slurm like job arrays, and instead use Stata to manage submitting jobs to Slurm.
Example: Converting CSV Files to Stata Data Sets
Suppose you needed to convert ten CSV files, file1.csv
through file10.csv
, into Stata data sets. That’s easy to do using a forvalues
loop:
If each file takes ten seconds to import, then the whole process will be done in one hundred seconds (one minute and forty seconds) and it’s probably not worth spending any more of your time on the problem. But if the files are enormous and take ten minutes to import, importing the ten files in parallel instead of one at a time would make this a ten minute job rather than a one hour and forty minute job.
Workers and Their Manager
To do so, you’ll split the loop into two do files. First, you need a worker do file that imports just one file (it will also need a way to know which file it is to import). More generally, the worker do file will carry out whatever task is inside your loop, but only once. Second, you need a manager do file that carries out the loop, but instead of executing the code inside the loop itself, it will submit a new worker to Slurm each time it goes through the loop.
For this example, the worker do file will look like the following:
args i
`i'.csv
import delimited filesave file`i', replace
The args
command copies arguments from the command line used to run the do file into macros. If you run:
do worker 1 stata -b
Then the macro i
will contain 1. That’s how the manager do file will tell this worker which file it is assigned to import.
And that’s it! Note that we don’t need to clear the memory each time because each worker starts in a fresh instance of Stata.
The manager do file will look like the following:
forvalues i = 1/10 {
"stata -b do worker `i'"
shell ssubmit --cores=1 --mem=5g }
The Stata shell
command tells Linux to run the command that follows. In this case ssubmit
submits a job to Slurm. The --cores=1
argument tells Slurm each job needs just one core, and the --mem=5g
tells Slurm each job needs 5GB of memory. We’ll talk more about identifying how many cores and how much memory your workers need shortly.
The part in double quotes is the command to be executed. stata -b do worker
runs worker.do
in batch mode, and the `i'
at the end sets up the command line argument that worker.do
uses to identify which file it should read.
While the worker do file should not do anything except the one task assigned to it, the manager can be just part of a larger do file. However, the shell
command does not wait until the command it sends to Linux is finished before telling Stata it’s ready to run the next command in the do file. In this case, that means the do file will run any code that comes after the loop before the imported data sets have been created. If your next step is to append the ten data sets into a single data set, you should put that in a separate do file that you run after you know all the jobs that were submitted to Slurm are complete. If you want to have Slurm send you an email when each job is complete, add to your ssubmit
command --email=
your_email_address
where your_email_address
should be replaced by your actual email address.
If you’re thinking “It would be nice if the process knew when all the Slurm jobs were complete so it could start the next step automatically” you’re right, and Slurm has tools for doing that. But that’s outside the scope of this article.
Worker Resources
Identifying the Computing Resources Used by a Linux Job talks about tools you can use to identify the computing resources your worker needs to run successfully. But here are some additional considerations for parallelizing a loop.
The Slurm cluster will run as many of your workers as it can, but if it runs out of resources workers will wait in the queue until resources become available. So the more resources you assign to each worker, the fewer workers it can run at once.
Cores
The SSCC has Stata MP licensed for 32 cores on all our Linux servers. However, having multiple cores work on the same task always involves some overhead. You should experiment, but if you have many tasks to carry out it’s likely that 32 tasks using one core each but running all at the same time will get work done faster than running one task at a time using 32 cores. That’s why in the example we only asked for one core.
Memory
Stata jobs normally need just a little more memory than the size of the data set they work with. Workers will crash if they run out of memory, so you can’t skimp here. But don’t use (much) more than you need. Most of the servers in the Slurm cluster have 384GB of memory and 44 cores, so about 8.7GB per core. If your workers need more memory than that per core, then memory will limit the number of workers Slurm can run at the same time rather than cores.
Reading and Writing Files
Each Slurm server has just one network connection to the file server, and there’s just one file server. If your workers spend a lot of their time reading and writing files, running too many workers on one server could overwhelm that server’s network connection, and running too many total workers could overwhelm the file server itself. In that case, you’ll get better performance by reducing the number of workers you run at the same time.
More on Running Workers in Parallel
Running many workers at the same time can cause some complications.
Logs and Other Files
When you run stata -b worker
Stata automatically creates a log file called worker.log
. However, all the workers will try to write to this log file at the same time, so it will usually get garbled. If you need your workers to create log files for debugging purposes, put an explicit log using
command in the worker do file and include the job identifier in the log name. In our example, that might be:
log using worker`i'.log, replace
That way you’ll get ten different log files (worker1.log
, worker2.log
, etc.), one for each worker, and they won’t be garbled.
The same applies to any other files the worker creates, such as data sets.
Random Seeds
If your worker does anything random (simulation, bootstrapping, multiple imputation, etc.) then you need to be careful about setting seeds.
Every worker should get the same seed every time it is run for reproducibility. However, different workers should never get the same seed. Also, you don’t want to reuse seeds across projects. An easy way to accomplish all these goals is to set the seed equal to the job identifier times an arbitrary number that’s different for every project:
set seed `=798153*`i''
Efficiency
Your worker do file will be run many times (perhaps many, many times) so do not have it do anything that it doesn’t absolutely need to do.
When you first wrote your loop you may have loaded a data set, cleaned it up a bit, and then started the loop. The worker do file will have to load the data it needs, but don’t have every worker repeat that data cleaning. Instead, do the data cleaning once and save the result as a data set the workers can use immediately. And if the data set contains twenty variables and the worker only needs to use five, consider only including those five in the data set the worker needs to load so it loads faster. It won’t matter much if you’re only going to run tens of workers, but if you will run tens of thousands of workers every little bit counts.
Reading and writing to disk is slow, so avoid it whenever possible. Don’t have workers write log files once you’re done debugging. If you need to switch between data sets, use frames rather than saving one data set and loading the other.