5  Slurm

When you submit a job to Slurm, you tell Slurm how many cores and how much memory you need and then it finds a server in its cluster that has those resources available and runs your job on it. Your job gets exclusive use of those resources. If no server has the resources you need available, your job will wait in the queue until they become available. Slurm Status will show you what resources are immediately available.

Submitting a job to Slurm is very easy; figuring out what resources it needs can be a bit trickier. The more accurate your estimates of what your job needs the more efficiently Slurm can use the servers, but please don’t let figuring out what you need prevent you from using Slurm. Make your best estimate, guess low if you’re not sure, and put Slurm to work for you.

The SSCC Slurm clusters include some servers that were purchased by the Department of Economics, the Wisconsin School of Business, the School of Medicine and Public Health, or individual faculty members (many with matching funds from the Office of the Vice Chancellor of Research and Graduate Education through a Research Core Revitalization Program grant). Members of those groups have priority on their servers. See the section on Partitions and Priorities for details.

5.1 Using Slurm

5.1.1 Submitting Jobs

To submit a job to Slurm or SlurmSilo, first log into Linstat or LinSilo and then use the ssubmit command.

Slurm Assistant is a web form that will ask you questions about your job and craft an appropriate ssubmit command for you. You can then copy the command and paste it into your Linstat session. We highly recommend it if you’re new to Slurm, want to use options you haven’t used before, or are putting together a complicated submission. Note that Slurm Assistant was designed for Slurm, and the command it gives you may need some tweaking before it will work in SlurmSilo.

On the other hand, the ssubmit command is not complicated. The basic syntax is:

ssubmit --cores=C --mem=Mg "command"

Where C is the number of cores or threads to use, M is the amount of memory to use in gigabytes, and command is the command you’d normally use to run the job directly on the server. See the Program Information for sample ssubmit commands for various programs. For example:

ssubmit --cores=1 --mem=10g "R CMD BATCH --no-save run_model_1.R"

will run the R script run_model_1.R using one core and 10GB of memory.

5.1.1.1 Partitions

To use a partition other than the default sscc partition, add --partition= or just --part= followed by the name of the partition you want to use. Partitions and Priorities describes the various partitions and when you might want to use them, but the default partition will be fine for most jobs.

5.1.1.2 Email Notifications

By default Slurm will send an email message to your preferred SSCC email address when your job finishes. That email will also tell you how much of the resources you reserved were actually used by your job. You can tell Slurm not to send you email with --noemail. This is highly recommended if you’re submitting large numbers of jobs at once.

You can tell Slurm to send notifications to a different email address with --email=address, where address should be replaced the email address you want to use. You can also change your preferred SSCC email address.

5.1.1.3 Preemption

Jobs on some servers may be preempted by higher priority jobs and have to start over–see Partitions and Priorities for details. You can require that your job run on a server where it cannot be preempted by adding --nopreempt. This may be useful for long jobs that will lose a lot of work if preempted, but if none of those servers are available your job will have to wait in the queue until one becomes available. We anticipate that jobs will rarely be preempted. Note that not all partitions have servers where jobs cannot be preempted.

5.1.1.4 GPUs

One of the Slurm servers (Slurm138) has two powerful A100 GPUs with 80GB of memory. (Other servers with GPUs are in temporary roles outside the Slurm cluster right now.)

To Slurm, a GPU is a “generic resource” or gres. To use one of the A100 GPUs add the following to your ssubmit command:

--gres=gpu:a100:1

To use both GPUs, add:

--gres=gpu:a100:2

This will tell Slurm that your job must run on Slurm138, and reserve one or both of its GPUs for your job’s use.

Slurm will create an environmental variable called CUDA_VISIBLE_DEVICES that contains the number of the GPU assigned to your job. Your job should look up the value of CUDA_VISIBLE_DEVICES so it knows which GPU to use.

(If you request both GPUs CUDA_VISIBLE_DEVICES will be a comma-separated list of numbers, but in that case you don’t need to look at it since it will always be the same.)

5.1.1.5 Multiple Servers

You can run jobs that will use multiple servers with the --nodes= switch. The --cores= and --mem= switches then specify the amount of cores and memory to use on each server. However, depending on how your code does its parallelization, you may also need to specify either the total number of tasks to run with --ntasks= or the number of tasks to run on each server with --ntasks-per-node=.

The newer servers in the cluster have high-speed RoCE networking designed for MPI. If your job uses MPI or similar protocols for communicating between processes running in parallel, you should require that all of the processes run on servers with RoCE with --constraint=roce.

5.1.1.6 Output that Goes to the Screen

Slurm will send any text output that would normally go to your screen to a file called slurm-*N*.out, where N is the JOBID of the job. If this output is important, we recommend you send it to a file of your choice instead by adding >outputfile to the end of your command, where outputfile should be replaced by the desired file name. For example:

matlab -nodisplay < my_matlab_script.m > my_matlab_script.log

Some programs send error messages separately from regular output (i.e. output goes to stdout and error messages go to stderr). You can send both to a single file with |& cat >. For example:

julia my_julia_script.jl |& cat > my_julia_script.log

5.1.1.7 sbatch and srun

ssubmit is a wrapper for the standard Slurm sbatch command, and you can use any of its switches as well. Switches always go before the command in quotes.

You may also use sbatch or srun instead of ssubmit, especially if you are interested in using Slurm job arrays. Some things to be aware of if you do:

  • The --cores= switch used by ssubmit translates to --cpus-per-task= in standard Slurm
  • The --nopreempt switch translates to --constraint=sscc-nopreempt if you are submitting your job to a general partition. If you are submitting to to an Econ faculty partition it translates to --constraint=econ-fac-nopreempt, and if you are submitting to an Econ grad student partition it translates to --constraint=econ-grad-nopreempt.
  • ssubmit automatically adds --prefer=sscc-nopreempt, or its equivalent for Econ partitions, to jobs that don’t specify --nopreempt. This tells Slurm to run the job on a server where it cannot be preempted if one is available but use other servers rather than waiting. You may want to do the same. (We’ve found that if you use --prefer with a job array it becomes a requirement rather than a preference.)
  • If you want email notifications, use --mail-type=fail,end --mail-user=address where address can be either your SSCC username, in which case it will go to your preferred SSCC email address, or an actual email address

5.1.2 Managing Jobs

To check on your job or see what other jobs are in the queue, type:

squeue -a

(-a means to show all jobs, including jobs in partitions you can’t use.) In the output, JOBID tells you the internal identifier Slurm assigned to your job, and NODELIST tells you which node(s) your job is running on. If it says (Resources) under NODELIST that means that the computing resources needed to run your job are not available at the moment. Your job will wait in the queue until other jobs finish, and then start as soon as possible. (BeginTime) means your job was preempted by a higher priority job and put back in the queue; Slurm will run it again shortly if resources are available.

To cancel a job use:

scancelJOBID

where JOBID should be replaced by the identifier of the job you want to cancel. You can cancel multiple jobs by replacing JOBID with a range specified as {start..end} where start and end should be replaced by numbers.

You can get a useful overview of the cluster with Slurm Status, which has a list of all the servers in the cluster and how much resources are free on each of them.

sinfo will summarize the partitions and servers in the cluster, and their current state.

5.2 About Slurm

5.2.1 Partitions & Priorities

When you submit a job to Slurm, you submit it to a partition. Partitions can have different priorities, time limits, and other properties. As usage of Slurm increases, we will continuously evaluate these settings. While a job is only submitted to one partition, servers can belong to (take jobs from) multiple partitions.

Some of the Slurm servers were purchased by particular groups and their affiliates have priority on those servers. Others were purchased by individual faculty members, and they have priority on those. High priority users can claim that priority by submitting to an appropriate partition, as described below.

A job submitted to a high priority partition will preempt a job run on a lower priority partition if it cannot run otherwise. A preempted job will go back in the queue and be run again as soon as the resources it needs are available. However it will have to start over. Some partitions (notably the default sscc partition) contain a mix of servers where jobs may be preempted and servers where jobs will not be preempted because no one else has priority on them. By default, jobs submitted to these partitions with ssubmit will run on the servers where they cannot be preempted if they are available. But you can use --nopreempt to specify that your job should only run on a server where it cannot be preempted. This is useful for long jobs where you’ll lose a lot of work if they are preempted. However, your jobs may have to wait in the queue until one of those servers becomes available. We anticipate that preemption will be rare.

Jobs submitted to partitions with equal priority will not preempt each other. In choosing which job to run next, Slurm considers the amount of Slurm computing time each user has used recently (past use decays exponentially with a half life of one week) and how long the job has been waiting to run.

If you submit a job without specifying a partition, it goes to the sscc partition. You can specify a different partition with the --partition= or just --part= switch.

Slurm Status will show you what resources are immediately available and what partitions they belong to.

5.2.1.1 General Partitions

Partition Name Servers Max Job Length Notes
sscc slurm[001-005], slurm[101-135], slurm[137-138] 10 Days Default partition
short slurm[013,015-016], slurm[101-138] 6 Hours For short jobs
long slurm135 30 days For long jobs

slurm[013,015-016], and slurm136 are reserved for short jobs (less than 6 hours). To use them submit your jobs to one of the short partitions, but they may run on other servers if they are available.

slurm135 runs long jobs (up to 30 days) submitted to the long partition, but it also takes jobs from other partitions. If you need to run jobs that will take longer than 30 days, contact the Help Desk.

slurm[013,015-016] and slurm[131-136] were purchased by the SSCC and jobs run on them will never be preempted by higher priority jobs. You can require that your job run on one of them by adding --nopreempt to your ssubmit command. If you use srun or sbatch, add --constraint=nopreempt.

5.2.1.2 Economics Partitions

The following partitions are available to affiliates of the Department of Economics:

Partition Name Servers Max Job Length Notes
econ slurm[101-130] 10 Days
econ-fac slurm[101-121], slurm[128-130] 10 Days For Econ faculty, Limit of 256 cores
econ-fac-short slurm[101-123], slurm[128-130] 2 Hours For short jobs
econ-grad slurm[101-126], slurm[128-130] 10 Days For Econ grad students
econ-grad-short slurm[124-130] 6 Hours For short jobs

Jobs submitted to these partitions will preempt jobs submitted to the general partitions. Jobs submitted to the faculty or grad student partitions will preempt jobs submitted to the general econ partition. Jobs submitted to the faculty partitions will preempt jobs submitted to the grad student partitions, but note that four servers are (slurm[124-127]) are reserved for grad students. The Economics partitions include some servers purchased by individual faculty members, and their jobs preempt all others on their servers.

The servers purchased by the Economics Department, and most of those purchased by faculty in Economics, have 256GB of memory. If you need more memory than that, you’ll want to submit your jobs to the general partitions where there are servers with 768GB or 1,024GB of memory.

You can access a tremendous amount of computing power by submitting jobs to econ (and even more by submitting to sscc). We suggest thinking of the other partitions as ways to get a smaller amount of computing power on short notice even if econ is full.

Jobs submitted to a faculty partition and run on slurm[114-123] will never be preempted. You can require that your job run on one of them by adding --nopreempt to you ssubmit command. Use constraint="econ-fac-nopreempt" with srun or sbatch.

Jobs submitted to grad student partition and run on slurm[124-127] will never be preempted. You can require that your job run on one of them by adding --nopreempt to you ssubmit command. Use constraint="econ-grad-nopreempt" with srun or sbatch.

5.2.1.3 Wisconsin School of Business Partition

The wsb-gpu partition is available to affiliates of the Wisconsin School of Business:

Partition Name Servers Max Job Length Notes
wsb-gpu slurm[137-138] 10 Days For jobs that need GPU

Jobs submitted to it will preempt jobs submitted to the general partitions. Please use this partition only for jobs that require the Nvidia A100 GPUs these servers contain. We appreciate WSB contributing these servers with their powerful GPUs, and WSB affiliates have access to many non-GPU servers in return. We would hate to see non-GPU jobs submitted by WSB affiliates preempt GPU jobs submitted by other SSCC members.

5.2.1.4 School of Medicine and Public Health Partition

The smph partition is available to affiliates of the School of Medicine and Public Health:

Partition Name Servers Max Job Length
smph slurm[001-005] 10 Days

These servers have 44 cores and 384GB of memory. Use the sscc partition if you need more of either.

5.2.1.5 Individual Partitions

Faculty who purchased servers have their own partition containing those servers. Faculty who purchased a share of a server have a partition containing that server where their usage is limited to their share. Your partition name is your last name (in all lower case). Jobs submitted to these partitions take priority over all other jobs and may run for up to 90 days (the time limit allows SSCC staff to maintain the servers).

5.2.1.6 SlurmSilo Partitions

Partition Name Servers Max Job Length Notes
sscc slurmsilo[001-004], slurmsilo011 7 Days Default partition
short slurmsilo[001-005], slurmsilo011 6 Hours For short jobs
smph slurmsilo[001-004], slurmsilo011 7 Days High priority partition for SMPH researchers
smph-short slurmsilo[001-005], slurmsilo011 6 Hours High priority partition for SMPH researchers running short jobs

slurmsilo005 is reserved for short jobs. To use it submit your jobs to one of the short partitions, but they may run on other servers if they are available.

All of the SlurmSilo servers were purchased by SMPH so SMPH researchers have priority on all of them. They can claim that priority by submitting their jobs to an smph partition.

5.2.2 Cluster Specifications

The Slurm (non-Silo) cluster currently consists of the following servers:

Servers Cores Memory (GB) GPU
slurm[130-136] 128 AMD 1024
slurm138 128 AMD 512 2x Nvidia A100, 80GB memory each
slurm137 128 AMD 512 2x Nvidia A100, 40GB memory each
slurm[128-129] 128 AMD 512
slurm[101-127] 128 AMD 256
slurm013 36 Intel 768
slurm[015-016] 48 Intel 768 Nvidia T4, 16GB memory
slurm[001-005] 44 Intel 384

The SlurmSilo cluster currently consists of the following servers:

Servers Cores Memory (GB)
slurmsilo011 80 Intel 768
slumsilo[001-005] 44 Intel 384

All of these servers were purchased by SMPH and SMPH researchers have priority on them.

5.3 Identifying Your Job’s Resource Needs

Slurm will give your job exactly the resources you reserve for it. If you reserve fewer cores than your job can use, then it will run more slowly than it could have. If you reserve less memory than it needs, it will crash. On the other hand, if you reserve more cores and memory than your job will actually use, the excess will be idle and no one can use them until your job is done.

Your estimates of your job’s resource needs will always be imperfect unless you’re running lots of identical jobs, and that’s okay. Just do your best. When in doubt, guess low.

Be mindful that other people need to use the Slurm cluster, but think in terms of taking turns rather than sharing. If your job can run fastest if it uses all the cores in a server, then use all the cores in a server and get out of the next person’s way–especially if your job uses lots of memory.

Memory tends to be the bottleneck in the SSCC Slurm cluster rather than cores. If you need a lot of memory (>100GbB, either for a single job or across a batch of jobs that run at the same time) be sure your code uses memory efficiently. Delete dataframes that are no longer needed, for example. Then pay close attention to how much memory your jobs actually use. Keep in mind that the Slurm cluster has many servers with 256GB of memory, so you can run many more jobs if you limit your memory usage to less than that. Also keep in mind how many jobs can be run per server: a server with 1024GB of memory can run three jobs that need 340GB each, but only two jobs that need 350GB each.

On the other hand, if you reserve all the cores in a server, you can just reserve all its memory too rather than trying to pin down exactly how much you need: no one else can use a server’s memory if you’re using all its cores.

To identify your job’s resource needs, first think through the job itself. If you told the program a number of cores to use, tell Slurm to reserve the same number. If it asked for a number of threads, reserve one core per thread. You’ll need at least enough memory to load your data (unless you’re running SAS). The Program Information chapter has some notes about typical resource usage for some programs. But you can get a much better sense of what a job needs by first starting it on Linstat for a few minutes and seeing what it uses before you submit it to Slurm, and then by reading the efficiency report in the email Slurm will send you when your job is complete.

5.3.1 Monitoring a Job

To start a job in batch mode so you can monitor its resource usage, use the command given for your program in the Program Information chapter. It’s the same command that you’re putting in quotes for ssubmit, but with an ampersand (&) at the end.

Then run top to see the top jobs on the server. The output will look something like this (usernames have been hidden for privacy, and the list has been truncated):

   PID  USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND  
 911054 user1     20   0   22.6g  21.9g   4384 R  4288   2.9  81566:20 a.out    
 275522 user2     20   0 3738648 999460  55320 R 314.5   0.1  42163:08 python   
1028196 user3     20   0   10.4g   1.4g 333524 S  99.3   0.2   1484:59 MATLAB   
1028236 user3     20   0   10.4g   1.3g 332884 S  99.3   0.2   1487:13 MATLAB   
1028412 user3     20   0   10.4g   1.3g 335700 S  99.3   0.2   1487:55 MATLAB   
1028484 user3     20   0   10.4g   1.3g 334660 S  99.3   0.2   1480:15 MATLAB   
1028543 user3     20   0   10.4g   1.3g 334216 S  99.3   0.2   1477:58 MATLAB   
1028728 user3     20   0   10.3g   1.4g 333268 S  99.3   0.2   1478:36 MATLAB   
1028312 user3     20   0   10.2g   1.3g 332724 S  98.7   0.2   1486:45 MATLAB   
1028198 user3     20   0   10.4g   1.3g 334844 S  98.4   0.2   1485:36 MATLAB   

The most important columns are RES (resident memory) and %CPU.

A ‘g’ at the end of the RES column indicates that the memory is given in gigabytes–if it doesn’t end with ‘g’ then the amount used is trivial. Thus the job being run by user1 is using 21.9GB of memory, while the job being run by user2 is using 999,460 bytes of memory, or 0.001GB. Round memory up when you submit your job to Slurm so you have a bit of a safety margin.

%CPU is measured in terms of cores, so a %CPU of 100 means a process is using one core all of the time. However, you’ll rarely see 100% because processes frequently have to wait for data to be retrieved from memory or disk. Also, if a server is busy processes will have to share cores.

Some jobs that use multiple cores will show up as multiple processes, each using a single core. The job being run by user3 is using 8 cores, though each process is spending roughly 1% of its time waiting for data. Each of those processes is using 1.3-1.4GB of memory, so the the job’s total memory usage is about 10.6GB.

Other jobs that use multiple cores show up as a single process with more than 100% CPU utilization (often much more). Divide by 100 to estimate the number of cores used, but this will probably be low. For example user1’s job is probably running on more than 43 cores, but not getting exclusive use of all of them. Most likely it would use all the cores in the server if it could.

Let your job run long enough that it can finish loading data and start on its main task, and then note its usage. Also note its PID (process ID). Then press q to quit top and end the job by typing:

killPID

where PID is the process ID you got from top. You’re now ready to submit the job to Slurm.

5.3.2 Efficiency Reports

When your job is finished you’ll get an email telling you so. It will also include an efficiency report which will tell you if you can request fewer resources when running similar jobs in the future.. You can get the same report for any job you’ve ran by running:

seffJOBID

where JOBID should be replaced by the identifier of the job you ran.

“CPU Efficiency” tells you what percent of the CPU time your job could have used was actually used. For example, if your job requested four cores and ran for one hour (“wall time,” meaning time as measured by a clock on the wall), it could have used four core-hours of CPU time. If it only used two core-hours, then it is 50% efficient. That could mean that it only used two cores for the full hour, or it could mean it spent part of the time on tasks that can only use one core and the rest on tasks that can use four cores. If the percent utilization is roughly 100 divided by the number of cores you reserved, then your job can actually use just one core. Low levels of CPU efficiency suggest your job would not lose much, if any, performance if it were given fewer cores.

“Memory Efficiency” is more straightforward because it only looks at the maximum amount of memory the job ever used. (If the job ever runs out of memory it will crash, so maximum usage is what matters.) While you want to maintain a safety margin, if your job’s Memory Efficiency is less than about 90% you can reduce its memory in the future.