import pandas as pd
for i in range(1, 11):
= pd.read_csv('file' + str(i) + '.csv')
dataset 'file' + str(i) + '.pickle') dataset.to_pickle(
Converting Python Loops to Parallel Loops Using Slurm (The Easy Way)
So your research requires carrying out a task multiple times, and you’ve written a Python loop to do it. Great! That saved you a lot of time writing code–but it doesn’t speed up running the code at all. But what if instead of completing the first task, then the next, then the next; all of them could be run at once? That’s absolutely possible if you send them to the SSCC’s Slurm cluster.
This article assumes you know Python well enough to write loops, but don’t know a lot about Slurm–in fact this article will cover everything you need to know about Slurm. Thus we won’t use advanced features of Slurm like job arrays, and instead use Python to manage submitting jobs to Slurm.
Example: Converting CSV Files to Python Data Sets
Suppose you needed to convert ten CSV files, file1.csv
through file10.csv
, into Python data sets stored in Pickle format. That’s easy to do using a for
loop:
If each file takes ten seconds to import, then the whole process will be done in one hundred seconds (one minute and forty seconds) and it’s probably not worth spending any more of your time on the problem. But if the files are enormous and take ten minutes to import, importing the ten files in parallel instead of one at a time would make this a ten minute job rather than a one hour and forty minute job.
Workers and Their Manager
To do so, you’ll split the loop into two Python scripts. First, you need a worker script that imports just one file (it will also need a way to know which file it is to import). More generally, the worker script will carry out whatever task is inside your loop, but only once. Second, you need a manager script that carries out the loop, but instead of executing the code inside the loop itself, it will submit a new worker to Slurm each time it goes through the loop.
For this example, the worker script will look like the following:
import sys
import pandas as pd
= sys.argv[1]
i = pd.read_csv('file' + str(i) + '.csv')
dataset 'file' + str(i) + '.pickle') dataset.to_pickle(
sys.argv
is a list containing the arguments of the command to run your script, starting with the name of the script itself. If you run:
1 python worker.py
Then sys.argv[1]
will contain 1. That’s how the manager script will tell this worker which file it is assigned to import.
And that’s it!
The manager script will look like the following:
import os
for i in range(1,10):
'ssubmit --cores=1 --mem=5g "python worker.py ' + str(i) + '"') os.system(
The os.system()
function tells Linux to run a command. In this case ssubmit
submits a job to Slurm. The --cores=1
argument tells Slurm each job needs just one core, and the --mem=5g
tells Slurm each job needs 5GB of memory. We’ll talk more about identifying how many cores and how much memory your workers need shortly.
The part in double quotes is the command to be executed. python worker.py
runs worker.py
, and adding str(i)
at the end sets up the command line argument that worker.py
uses to identify which file it should read.
While the worker script should not do anything except the one task assigned to it, the manager can be just part of a larger script. However, the os.system()
function does not wait until the command it sends to Linux is finished before telling Python to run the next command in the script. In this case, that means the script will run any code that comes after the loop before the imported data sets have been created. If your next step is to append the ten data sets into a single data set, you should put that in a separate script that you run after you know all the jobs that were submitted to Slurm are complete. If you want to have Slurm send you an email when each job is complete, add to your ssubmit
command --email=
your_email_address
where your_email_address
should be replaced by your actual email address.
If you’re thinking “It would be nice if the process knew when all the Slurm jobs were complete so it could start the next step automatically” you’re right, and Slurm has tools for doing that. But that’s outside the scope of this article.
Worker Resources
Identifying the Computing Resources Used by a Linux Job talks about tools you can use to identify the computing resources your worker needs to run successfully. But here are some additional considerations for parallelizing a loop.
The Slurm cluster will run as many of your workers as it can, but if it runs out of resources workers will wait in the queue until resources become available. So the more resources you assign to each worker, the fewer workers it can run at once.
Cores
Most of the Slurm servers have 44 cores and you are welcome to use all of them. However, having multiple cores work on the same task always involves some overhead. You should experiment, but if you have many tasks to carry out it’s likely that 44 tasks using one core each but running all at the same time will get work done faster than running one task at a time using 44 cores. That’s why in the example we only asked for one core.
Memory
Python jobs normally need just a little more memory than the size of all the data sets they work with. Workers will crash if they run out of memory, so you can’t skimp here. But don’t use (much) more than you need. Most of the servers in the Slurm cluster have 384GB of memory and 44 cores, so about 8.7GB per core. If your workers need more memory than that per core, then memory will limit the number of workers Slurm can run at the same time rather than cores.
Reading and Writing Files
Each Slurm server has just one network connection to the file server, and there’s just one file server. If your workers spend a lot of their time reading and writing files, running too many workers on one server could overwhelm that server’s network connection, and running too many total workers could overwhelm the file server itself. In that case, you’ll get better performance by reducing the number of workers you run at the same time.
More on Running Workers in Parallel
Running many workers at the same time can cause some complications.
Output Files
If the workers create output files, make sure they do not all try to write to the same output file. This applies to any kind of output, not just data sets. An easy way to do that is to put the job identifier in the file name.
Random Seeds
If your worker does anything random (simulation, bootstrapping, multiple imputation, etc.) then you need to be careful about setting seeds.
Every worker should get the same seed every time it is run for reproducibility. However, different workers should never get the same seed. Also, you don’t want to reuse seeds across projects. An easy way to accomplish all these goals is to set the seed equal to the job identifier times an arbitrary number that’s different for every project:
import random
78921*i) random.seed(
Efficiency
Your worker script will be run many times (perhaps many, many times) so do not have it do anything that it doesn’t absolutely need to do.
When you first wrote your loop you may have loaded a data set, cleaned it up a bit, and then started the loop. The worker script will have to load the data it needs, but don’t have every worker repeat that data cleaning. Instead, do the data cleaning once and save the result as a Pickle file the workers can use immediately. And if the data set contains twenty variables and the worker only needs to use five, consider only including those five in the data set the worker needs to load so it loads faster. It won’t matter much if you’re only going to run tens of workers, but if you will run tens of thousands of workers every little bit counts.
Reading and writing to disk is slow, so avoid it whenever possible.