5 Slurm
When you submit a job to Slurm, you tell Slurm how many cores and how much memory you need and then it finds a server in its cluster that has those resources available and runs your job on it. Your job gets exclusive use of those resources. If no server has the resources you need available, your job will wait in the queue until they become available. Slurm Status will show you what resources are immediately available.
Submitting a job to Slurm is very easy; figuring out what resources it needs can be a bit trickier. The more accurate your estimates of what your job needs the more efficiently Slurm can use the servers, but please don’t let figuring out what you need prevent you from using Slurm. Make your best estimate, guess low if you’re not sure, and put Slurm to work for you.
The SSCC Slurm clusters include some servers that were purchased by the Department of Economics, the Wisconsin School of Business, the School of Medicine and Public Health, or individual faculty members (many with matching funds from the Office of the Vice Chancellor of Research and Graduate Education through a Research Core Revitalization Program grant). Members of those groups have priority on their servers. See the section on Partitions and Priorities for details.
5.1 Using Slurm
5.1.1 Submitting Jobs
To submit a job to Slurm or SlurmSilo, log into Linstat or LinSilo and use the ssubmit
command.
Slurm Assistant is a web form that will ask you questions about your job and craft an appropriate ssubmit
command for you. You can then copy the command and paste it into your Linstat session. We highly recommend it if you’re new to Slurm, want to use options you haven’t used before, or are putting together a complicated submission. Note that Slurm Assistant was designed for Slurm, and the command it gives you may need some tweaking before it will work in SlurmSilo.
On the other hand, the ssubmit
command is not complicated. The basic syntax is:
ssubmit --cores=C --mem=Mg "command"
Where C is the number of cores or threads to use, M is the amount of memory to use in gigabytes, and command
is the command you’d normally use to run the job directly on the server. See the Program Information chapter for sample ssubmit
commands for various programs. For example:
ssubmit --cores=1 --mem=10g "R CMD BATCH --no-save run_model_1.R"
will run the R script run_model_1.R
using one core and 10GB of memory.
5.1.1.1 Partitions
To use a partition other than the default sscc
partition, add --partition=
or just --part=
followed by the name of the partition you want to use. Partitions and Priorities describes the various partitions and when you might want to use them, but the default partition will be fine for most jobs.
5.1.1.2 Email Notifications
By default Slurm will send an email message to your preferred SSCC email address when your job finishes. That email will also tell you how much of the resources you reserved were actually used by your job. You can tell Slurm not to send you email with --noemail
. This is highly recommended if you’re submitting large numbers of jobs at once.
You can tell Slurm to send notifications to a different email address with --email=address
, where address
should be replaced the email address you want to use. You can also change your preferred SSCC email address.
5.1.1.3 Preemption
Jobs on some servers may be preempted by higher priority jobs and have to start over–see Partitions and Priorities for details. You can require that your job run on a server where it cannot be preempted by adding --nopreempt
. This may be useful for long jobs that will lose a lot of work if preempted, but if none of those servers are available your job will have to wait in the queue until one becomes available. We anticipate that jobs will rarely be preempted. Note that not all partitions have servers where jobs cannot be preempted.
5.1.1.4 GPUs
There are three servers in Slurm or SlurmSilo with GPUs:
slurm138 has two A100 GPUS with 80GB of memory each
slurm137 has two A100 GPUs with 40GB of memory each
slurmsilo139 has two L40S GPUs with 48GB of memory each
Two servers with T4 GPUs are temporarily being used for other tasks but will be returned to Slurm soon.
To Slurm, a GPU is a “generic resource” or gres. To use a GPU add the following to your ssubmit
command:
--gres=gpu
To use two GPUs, add the following:
--gres=gpu:2
To use an A100 GPU specifically, add the following:
--gres=gpu:a100
Right now this is redundant, but will be important again when the older and smaller T4 GPUs are back in Slurm.
To use two A100s, add:
--gres=gpu:a100:2
To use an A100s with 80GB of memory specifically, add:
–gres=gpu:a100-80gb
Slurm will create an environmental variable called CUDA_VISIBLE_DEVICES
that contains the number of the GPU assigned to your job. Your job should look up the value of CUDA_VISIBLE_DEVICES
so it knows which GPU to use, but may do that automatically.
(If you request both GPUs CUDA_VISIBLE_DEVICES
will be a comma-separated list of numbers, but in that case you don’t need to look at it since it will always be the same.)
5.1.1.5 Multiple Servers
You can run jobs that will use multiple servers with the --nodes=
switch. The --cores=
and --mem=
switches then specify the amount of cores and memory to use on each server. However, depending on how your code does its parallelization, you may also need to specify either the total number of tasks to run with --ntasks=
or the number of tasks to run on each server with --ntasks-per-node=
.
The newer servers in the cluster have high-speed RoCE networking designed for MPI. If your job uses MPI or similar protocols for communicating between processes running in parallel, you should require that all of the processes run on servers with RoCE with --constraint=roce
.
5.1.1.6 Output that Goes to the Screen
Slurm will send any text output that would normally go to your screen to a file called slurm-J.out
, where J is the JOBID
of the job. If this output is important, you may want to send it to a file of your choice instead by adding >outputfile
to the end of your command, where outputfile
should be replaced by the desired file name. For example:
matlab -nodisplay < my_matlab_script.m > my_matlab_script.log
Usually error messages are sent to the Slurm output file, but with some programs they are lost. In those cases you can send them to a separate error file by adding the following to your ssubmit
command:
--error="slurm-%j.err"
This will give the error file the same name as the output file, but with the extension .err
. This can also be useful if you just want to look at the error messages separately.
5.1.1.7 sbatch
and srun
ssubmit
is a wrapper for the standard Slurm sbatch
command, and you can use any of its switches as well. Switches always go before the command in quotes.
You may choose to use sbatch
or srun
instead of ssubmit
. Some things to be aware of if you do:
- The
--cores=
switch used byssubmit
translates to--cpus-per-task=
in standard Slurm - The
--nopreempt
switch translates to--constraint=sscc-nopreempt
if you are submitting your job to a general partition. If you are submitting to to an Econ faculty partition it translates to--constraint=econ-fac-nopreempt
, and if you are submitting to an Econ grad student partition it translates to--constraint=econ-grad-nopreempt
. ssubmit
automatically adds--prefer=sscc-nopreempt
, or its equivalent for Econ partitions, to jobs that don’t specify--nopreempt
. This tells Slurm to run the job on a server where it cannot be preempted if one is available but use other servers rather than waiting. You may want to do the same. (We’ve found that if you use--prefer
with a job array it becomes a requirement rather than a preference.)- If you want email notifications, use
--mail-type=fail,end --mail-user=address
whereaddress
can be either your SSCC username, in which case it will go to your preferred SSCC email address, or an actual email address
5.1.2 Managing Jobs
To check on your job or see what other jobs are in the queue, type:
squeue -a
(-a
means to show all jobs, including jobs in partitions you can’t use.) In the output, JOBID
tells you the internal identifier Slurm assigned to your job, and NODELIST
tells you which node(s) your job is running on. If it says (Resources)
under NODELIST
that means that the computing resources needed to run your job are not available at the moment. Your job will wait in the queue until other jobs finish, and then start as soon as possible. (BeginTime)
means your job was preempted by a higher priority job and put back in the queue; Slurm will run it again shortly if resources are available.
To cancel a job use:
scancelJOBID
where JOBID
should be replaced by the identifier of the job you want to cancel. You can cancel multiple jobs by replacing JOBID
with a range specified as {start..end}
where start
and end
should be replaced by numbers.
You can get a useful overview of the cluster with Slurm Status, which has a list of all the servers in the cluster and how much resources are free on each of them.
sinfo
will summarize the partitions and servers in the cluster that you can use, and their current state. For information about all the partitions and servers, add -a
.
5.2 About Slurm
5.2.1 Partitions & Priorities
When you submit a job to Slurm, you submit it to a partition. Partitions can have different priorities, time limits, and other properties. While a job is only submitted to one partition, servers can belong to (take jobs from) multiple partitions.
Some of the Slurm servers were purchased by particular groups and their affiliates have priority on those servers. Others were purchased by individual faculty members, and they have priority on those. High priority users can claim that priority by submitting to an appropriate partition, as described below.
A job submitted to a high priority partition will preempt a job run on a lower priority partition if it cannot run otherwise. A preempted job will go back in the queue and be run again as soon as the resources it needs are available. However it will have to start over. Some partitions (notably the default sscc
partition) contain a mix of servers where jobs may be preempted and servers where jobs will not be preempted because no one else has priority on them. By default, jobs submitted to these partitions with ssubmit
will run on the servers where they cannot be preempted if they are available. But you can use --nopreempt
to specify that your job should only run on a server where it cannot be preempted. This is useful for long jobs where you’ll lose a lot of work if they are preempted. However, your jobs may have to wait in the queue until one of those servers becomes available. In practice, jobs are rarely preempted.
Jobs submitted to partitions with equal priority will not preempt each other. In choosing which job to run next, Slurm considers the amount of Slurm computing time each user has used recently (past use decays exponentially with a half life of one week) and how long the job has been waiting to run. If you see that someone has submitted a large number of jobs, that will make them a low priority. A job you submit will not preempt their jobs, but your job is likely to be chosen to run next rather than theirs.
If you submit a job without specifying a partition, it goes to the sscc
partition. You can specify a different partition with the --partition=
or just --part=
switch.
Slurm Status will show you what resources are immediately available and what partitions they belong to.
5.2.1.1 General Partitions
Partition Name | Servers | Max Job Length | Notes |
---|---|---|---|
sscc |
slurm[001-005], slurm[101-135], slurm[137-138] | 10 Days | Default partition |
short |
slurm[101-138] | 6 Hours | For short jobs |
long |
slurm135 | 30 days | For long jobs |
slurm136 is reserved for short jobs (less than 6 hours). To use it submit your jobs to one of the short
partitions, but they may run on other servers if they are available.
slurm135 runs long jobs (up to 30 days) submitted to the long
partition, but it also takes jobs from other partitions. If you need to run jobs that will take longer than 30 days, contact the Help Desk.
slurm[131-136] were purchased by the SSCC and jobs run on them will never be preempted by higher priority jobs. You can require that your job run on one of them by adding --nopreempt
to your ssubmit
command. If you use srun
or sbatch
, add --constraint=nopreempt
.
5.2.1.2 Economics Partitions
The following partitions are available to affiliates of the Department of Economics:
Partition Name | Servers | Max Job Length | Notes |
---|---|---|---|
econ |
slurm[101-130] | 10 Days | |
econ-fac |
slurm[101-121], slurm[128-130] | 10 Days | For Econ faculty, Limit of 256 cores |
econ-fac-short |
slurm[101-123], slurm[128-130] | 2 Hours | For short jobs |
econ-grad |
slurm[101-126], slurm[128-130] | 10 Days | For Econ grad students |
econ-grad-short |
slurm[124-130] | 6 Hours | For short jobs |
Jobs submitted to these partitions will preempt jobs submitted to the general partitions. Jobs submitted to the faculty or grad student partitions will preempt jobs submitted to the general econ
partition. Jobs submitted to the faculty partitions will preempt jobs submitted to the grad student partitions, but note that four servers are (slurm[124-127]) are reserved for grad students. The Economics partitions include some servers purchased by individual faculty members, and their jobs preempt all others on their servers. The econ
partition allows you to use both the faculty and grad student servers, but at the risk of being preempted.
The servers purchased by the Economics Department, and most of those purchased by faculty in Economics, have 256GB of memory. If you need more memory than that, you’ll want to submit your jobs to the general partitions where there are servers with 768GB or 1,024GB of memory.
Jobs submitted to a faculty partition and run on slurm[114-123] will never be preempted. You can require that your job run on one of them by adding --nopreempt
to you ssubmit
command. Use constraint="econ-fac-nopreempt"
with srun
or sbatch
.
Jobs submitted to grad student partition and run on slurm[124-127] will never be preempted. You can require that your job run on one of them by adding --nopreempt
to you ssubmit
command. Use constraint="econ-grad-nopreempt"
with srun
or sbatch
.
5.2.1.3 Wisconsin School of Business Partition
The wsb-gpu
partition is available to affiliates of the Wisconsin School of Business:
Partition Name | Servers | Max Job Length | Notes |
---|---|---|---|
wsb-gpu |
slurm[137-138] | 10 Days | For jobs that need GPU |
Jobs submitted to it will preempt jobs submitted to the general partitions. Please use this partition only for jobs that require the A100 GPUs these servers contain. We appreciate WSB contributing these servers with their powerful GPUs, and WSB affiliates have access to many non-GPU servers in return. We would hate to see non-GPU jobs submitted by WSB affiliates preempt GPU jobs submitted by other SSCC members.
5.2.1.4 School of Medicine and Public Health Partition
The smph
partition is available to affiliates of the School of Medicine and Public Health:
Partition Name | Servers | Max Job Length |
---|---|---|
smph |
slurm[001-005] | 10 Days |
Jobs submitted to it will preempt jobs submitted to the general partitions. These servers have 44 cores and 384GB of memory. Use the sscc
partition if you need more of either.
5.2.1.5 Individual Partitions
Faculty who purchased servers have their own partition containing those servers. Faculty who purchased a share of a server have a partition containing that server where their usage is limited to their share. Your partition name is your last name (in all lower case). Jobs submitted to these partitions take priority over all other jobs and may run for up to 90 days (the time limit allows SSCC staff to maintain the servers).
5.2.1.6 SlurmSilo Partitions
Partition Name | Servers | Max Job Length | Notes |
---|---|---|---|
sscc |
slurmsilo[001-004], slurmsilo011 | 7 Days | Default partition |
short |
slurmsilo[001-005], slurmsilo011 | 6 Hours | For short jobs |
smph |
slurmsilo[001-004], slurmsilo011 | 7 Days | High priority partition for SMPH researchers |
smph-short |
slurmsilo[001-005], slurmsilo011 | 6 Hours | High priority partition for SMPH researchers running short jobs |
gpu |
slurmsilo[139] | 7 Days | GPU server |
vikas |
slurmsilo[139] | 7 Days | High priority partition for members of Vikas Singh’s research group |
slurmsilo005 is reserved for short jobs. To use it submit your jobs to one of the short
partitions, but they may run on other servers if they are available.
Most of the SlurmSilo servers were purchased by SMPH so SMPH researchers have priority on all of them. They can claim that priority by submitting their jobs to an smph
partition.
slurmsilo139 was purchased by Vikas Singh’s research group and members of that group have priority on it. They can claim priority by submitting their jobs to the vikas
partition.
5.2.2 Cluster Specifications
The Slurm (non-Silo) cluster currently consists of the following servers:
Servers | Cores | Memory (GB) | GPU |
---|---|---|---|
slurm[130-136] | 128 AMD | 1024 | |
slurm138 | 128 AMD | 512 | 2x Nvidia A100, 80GB memory each |
slurm137 | 128 AMD | 512 | 2x Nvidia A100, 40GB memory each |
slurm[128-129] | 128 AMD | 512 | |
slurm[101-127] | 128 AMD | 256 | |
slurm[001-005] | 44 Intel | 384 |
The SlurmSilo cluster currently consists of the following servers:
Servers | Cores | Memory (GB) | GPU |
---|---|---|---|
slurmsilo011 | 80 Intel | 768 | |
slumsilo[001-005] | 44 Intel | 384 | |
slurmsilo139 | 64 Intel | 512 | 2x Nvidia L40S, 48GB memory each |
5.3 Identifying Your Job’s Resource Needs
Slurm will give your job exactly the resources you reserve for it. If you reserve fewer cores than your job can use, then it will run more slowly than it could have. If you reserve less memory than it needs, it will crash. On the other hand, if you reserve more cores and memory than your job will actually use, the excess will be idle and no one can use them until your job is done.
Your estimates of your job’s resource needs will always be imperfect unless you’re running lots of identical jobs, and that’s okay. Just do your best. When in doubt, guess low.
Be mindful that other people need to use the Slurm cluster, but think in terms of taking turns rather than sharing. If your job can run fastest if it uses all the cores in a server, then use all the cores in a server and get out of the next person’s way–especially if your job uses lots of memory.
Memory tends to be the bottleneck in the SSCC Slurm cluster rather than cores. If you need a lot of memory (>100GbB, either for a single job or across a batch of jobs that run at the same time) be sure your code uses memory efficiently. Delete dataframes that are no longer needed, for example. Then pay close attention to how much memory your jobs actually use. Keep in mind that the Slurm cluster has many servers with 256GB of memory, so you can run many more jobs if you limit your memory usage to less than that. Also keep in mind how many jobs can be run per server: a server with 1024GB of memory can run three jobs that need 340GB each, but only two jobs that need 350GB each.
On the other hand, if you reserve all the cores in a server, you can just reserve all its memory too rather than trying to pin down exactly how much you need: no one else can use a server’s memory if you’re using all its cores.
To identify your job’s resource needs, first think through the job itself. If you told the program a number of cores to use, tell Slurm to reserve the same number. If it asked for a number of threads, reserve one core per thread. You’ll need at least enough memory to load your data (unless you’re running SAS). The Program Information chapter has some notes about typical resource usage for some programs. But you can get a much better sense of what a job needs by first starting it on Linstat for a few minutes and seeing what it uses before you submit it to Slurm, and then by reading the efficiency report in the email Slurm will send you when your job is complete.
5.3.1 Monitoring a Job
To start a job in batch mode so you can monitor its resource usage, use the command given for your program in the Program Information chapter. It’s the same command that you’re putting in quotes for ssubmit
, but with an ampersand (&
) at the end.
Then run top
to see the top jobs on the server. The output will look something like this (usernames have been hidden for privacy, and the list has been truncated):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
911054 user1 20 0 22.6g 21.9g 4384 R 4288 2.9 81566:20 a.out
275522 user2 20 0 3738648 999460 55320 R 314.5 0.1 42163:08 python
1028196 user3 20 0 10.4g 1.4g 333524 S 99.3 0.2 1484:59 MATLAB
1028236 user3 20 0 10.4g 1.3g 332884 S 99.3 0.2 1487:13 MATLAB
1028412 user3 20 0 10.4g 1.3g 335700 S 99.3 0.2 1487:55 MATLAB
1028484 user3 20 0 10.4g 1.3g 334660 S 99.3 0.2 1480:15 MATLAB
1028543 user3 20 0 10.4g 1.3g 334216 S 99.3 0.2 1477:58 MATLAB
1028728 user3 20 0 10.3g 1.4g 333268 S 99.3 0.2 1478:36 MATLAB
1028312 user3 20 0 10.2g 1.3g 332724 S 98.7 0.2 1486:45 MATLAB
1028198 user3 20 0 10.4g 1.3g 334844 S 98.4 0.2 1485:36 MATLAB
The most important columns are RES (resident memory) and %CPU.
A ‘g’ at the end of the RES column indicates that the memory is given in gigabytes–if it doesn’t end with ‘g’ then the amount used is trivial. Thus the job being run by user1 is using 21.9GB of memory, while the job being run by user2 is using 999,460 bytes of memory, or 0.001GB. Round memory up when you submit your job to Slurm so you have a bit of a safety margin.
%CPU is measured in terms of cores, so a %CPU of 100 means a process is using one core all of the time. However, you’ll rarely see 100% because processes frequently have to wait for data to be retrieved from memory or disk. Also, if a server is busy processes will have to share cores.
Some jobs that use multiple cores will show up as multiple processes, each using a single core. The job being run by user3 is using 8 cores, though each process is spending roughly 1% of its time waiting for data. Each of those processes is using 1.3-1.4GB of memory, so the the job’s total memory usage is about 10.6GB.
Other jobs that use multiple cores show up as a single process with more than 100% CPU utilization (often much more). Divide by 100 to estimate the number of cores used, but this will probably be low. For example user1’s job is probably running on more than 43 cores, but not getting exclusive use of all of them. Most likely it would use all the cores in the server if it could.
Let your job run long enough that it can finish loading data and start on its main task, and then note its usage. Also note its PID (process ID). Then press q
to quit top
and end the job by typing:
kill PID
where PID
is the process ID you got from top
. You’re now ready to submit the job to Slurm.
5.3.2 Efficiency Reports
When your job is finished you’ll get an email telling you so. It will also include an efficiency report which will tell you if you can request fewer resources when running similar jobs in the future.. You can get the same report for any job you’ve ran by running:
seff JOBID
where JOBID
should be replaced by the identifier of the job you ran.
“CPU Efficiency” tells you what percent of the CPU time your job could have used was actually used. For example, if your job requested four cores and ran for one hour (“wall time,” meaning time as measured by a clock on the wall), it could have used four core-hours of CPU time. If it only used two core-hours, then it is 50% efficient. That could mean that it only used two cores for the full hour, or it could mean it spent part of the time on tasks that can only use one core and the rest on tasks that can use four cores. If the percent utilization is roughly 100 divided by the number of cores you reserved, then your job can actually use just one core. Low levels of CPU efficiency suggest your job would not lose much, if any, performance if it were given fewer cores.
“Memory Efficiency” is more straightforward because it only looks at the maximum amount of memory the job ever used. (If the job ever runs out of memory it will crash, so maximum usage is what matters.) While you want to maintain a safety margin, if your job’s Memory Efficiency is less than about 90% you can reduce its memory in the future.