A Practical Guide to SLURM on HPC Clusters

A lot of people have asked me how I use compute clusters for research. It's one of those things that looks intimidating at first, but once you understand the basic idea, it clicks pretty fast. This post walks through SLURM — the scheduler used on most HPC systems — from scratch, with real examples you can actually run.

1. What Is SLURM, and Why Does It Exist?

Imagine a shared supercomputer with hundreds of CPU and GPU nodes. Dozens of researchers all want to run jobs at the same time. Without some kind of traffic controller, it would be chaos — people stepping on each other's resources, no fairness, no accounting.

SLURM (Simple Linux Utility for Resource Management) is that traffic controller. You describe what resources you need and what you want to run, and SLURM figures out when and where to actually execute it. You don't log into a compute node directly — you hand your job to SLURM and it handles the rest.

The mental model is simple: SLURM is a queue manager. You submit jobs, they wait in line, and when resources free up, your job runs. The scheduler tries to be fair and efficient across all users on the system.

2. Core Concepts Worth Knowing

Nodes

Each physical machine in the cluster is a node. Some nodes have only CPUs, others have GPUs attached. You generally never interact with them directly — SLURM abstracts them away and decides where your job lands.

Partitions

A partition is a named group of nodes with its own rules: maximum job duration, allowed users, resource limits, and so on. Think of it like different lanes on a highway. You pick a partition based on what kind of job you're running — a quick test, a multi-day training run, or something that needs a GPU.

Before writing any script, run this command to see what partitions exist on your cluster:

sinfo

The output will look something like this:

PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpu             up 3-00:00:00      4    mix gpu[01-04]
short           up   04:00:00      8   idle node[01-08]
convergence*    up 15-00:00:00     9    mix node[01-09]

The TIMELIMIT column tells you the hard cap for jobs in each partition. Never request more time than that limit — your submission will be rejected outright.

Jobs, Tasks, and Steps

A job is the unit you submit to SLURM. Inside a job you can have one or more tasks (separate processes), and tasks are made up of steps (usually individual srun calls). For most research workloads — running a Python script, fine-tuning a model, processing a dataset — you'll have one task per job. The distinction matters more for MPI workloads where you want hundreds of processes coordinating across nodes.

3. The Commands You'll Use Every Day

# See all partitions and node states
sinfo

# See all jobs currently running or queued
squeue

# See only your own jobs
squeue -u $USER

# Get detailed info about a specific job
scontrol show job <job_id>

# See what hardware (GPUs etc.) nodes have
scontrol show nodes | grep -i gres

# Cancel a job
scancel <job_id>

# Cancel all your jobs at once
scancel -u $USER

Run sinfo and scontrol show nodes | grep -i gres before you write your first script. Between the two you'll know what partitions exist, their time limits, and the exact names of any GPUs available — all of which you need to fill in correctly before your job can be scheduled.

4. Writing Your First Batch Script

The standard way to run something on SLURM is with a batch script: a shell script with special #SBATCH comment lines at the top that tell SLURM what resources to allocate.

#!/bin/bash
#SBATCH --job-name=my_first_job       # Name shown in squeue
#SBATCH --partition=short             # Which queue to use
#SBATCH --nodes=1                     # Number of machines
#SBATCH --ntasks=1                    # Number of processes
#SBATCH --cpus-per-task=4             # CPU cores per process
#SBATCH --mem=16G                     # Total RAM
#SBATCH --time=02:00:00               # Wall time limit HH:MM:SS
#SBATCH --output=%x-%j.out            # stdout log (%x=jobname, %j=jobid)
#SBATCH --error=%x-%j.err             # stderr log

# Load software environment
module load python/3.10

# Run your code
python my_script.py

Submit it with:

sbatch job.sh
# Submitted batch job 123456

You'll get back a job ID. Hold onto it — you'll need it to check status or cancel. After submission, watch your job with:

watch -n 5 squeue -u $USER

This refreshes the view every 5 seconds. Once the job starts running, the ST column changes from PD (pending) to R (running).

On output filenames: %x expands to your job name and %j to the job ID. A job named my_first_job with ID 123456 produces my_first_job-123456.out. This avoids overwriting logs from previous runs and makes it easy to match output files to their corresponding submissions.

5. Key Parameters and Common Mistakes

Parameter	What it controls	Common mistake
`--time`	Maximum wall time (HH:MM:SS). Job is killed if it exceeds this.	Setting it absurdly large (e.g. `20000:00:00`) — the scheduler deprioritises these jobs and they often never run.
`--mem`	Total RAM for the job.	Requesting 512G when no node has that much — job can never be placed.
`--cpus-per-task`	CPU cores per task.	Leaving it at 1 while running PyTorch with multi-worker DataLoaders — GPU starves waiting for data, utilisation drops to 30%.
`--gres`	Generic resources — most commonly GPUs.	Using a GPU name that doesn't match your cluster. Always verify with `scontrol show nodes \| grep Gres` first.
`--partition`	Which queue to submit to.	Not specifying it, ending up in a CPU-only default partition when you needed a GPU node.

A note on --time specifically

This catches more people than anything else. If you estimate a job takes two days, add a reasonable buffer:

# Estimated ~2 days, request 3 days
#SBATCH --time=72:00:00

What you should never write is --time=20000:00:00 hoping the job runs as long as it needs. SLURM's backfill scheduler can fit short jobs into gaps in the queue — jobs with enormous time requests can't be backfilled, so they just sit there. You'll often wait much longer for a 10-day time slot than for a 3-day one, even for the exact same job.

A note on --mem

Request roughly what your job actually uses, not a round number that sounds safe. You can profile memory usage on a small run first:

# After a completed job, check actual memory used
sacct -j <job_id> --format=JobID,MaxRSS,Elapsed

MaxRSS shows peak memory consumption in kilobytes. Use that to set a realistic --mem value with a 20–30% buffer.

6. GPU Jobs

For GPU work, the critical first step is finding out what GPUs your cluster has and what they're called. The naming is not standardised — different clusters use different conventions.

scontrol show nodes | grep -i gres

You might see something like:

Gres=gpu:a100:4
# or
Gres=gpu:v100:2
# or on some clusters using MIG slices
Gres=gpu:a100_3g.40gb:8

That gpu:a100:4 means the node has 4 A100 GPUs, and the resource name to use in your script is a100. Your batch script then becomes:

#!/bin/bash
#SBATCH --job-name=gpu_train
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --gres=gpu:a100:1
#SBATCH --time=24:00:00
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err

module load cuda/12.8
module load cudnn/9.5.0_cuda12

source activate myenv

python train.py

If you get "Requested node configuration is not available": this almost always means either the GPU name is wrong, the partition doesn't support GPUs, or your memory/CPU request is impossible to satisfy on any available node. Check sinfo for GPU partitions and scontrol show nodes | grep Gres for exact GPU names.

How many CPU cores per GPU?

For PyTorch training, the DataLoader workers run entirely on CPU. If you don't give them enough cores, your GPU sits idle waiting for the next batch. A practical rule: set num_workers in your DataLoader to match --cpus-per-task, and use 4–8 cores per GPU as a starting point.

from torch.utils.data import DataLoader

# Match num_workers to your --cpus-per-task value
loader = DataLoader(dataset, batch_size=64, num_workers=8, pin_memory=True)

7. Array Jobs — Running Many Tasks at Once

Array jobs are one of SLURM's most useful features. Instead of manually submitting the same script dozens of times with different parameters, you submit once and SLURM creates N sub-jobs automatically, each with a unique $SLURM_ARRAY_TASK_ID.

#!/bin/bash
#SBATCH --job-name=sweep
#SBATCH --array=1-10                  # Creates 10 sub-jobs, IDs 1 through 10
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=02:00:00
#SBATCH --output=logs/%A_%a.out       # %A = parent job ID, %a = array index
#SBATCH --error=logs/%A_%a.err

python run_experiment.py --seed $SLURM_ARRAY_TASK_ID

Sub-jobs run in parallel as resources become available. Their IDs in squeue appear as 123456_1, 123456_2, and so on, making them easy to track and cancel individually.

You can also use the array index to slice a dataset. Say you have a large dataset split into 10 chunks:

# process_chunk.py
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--chunk-id', type=int, required=True)
args = parser.parse_args()

CHUNK_SIZE = 100_000
start = (args.chunk_id - 1) * CHUNK_SIZE + 1
end   = args.chunk_id * CHUNK_SIZE

result = sum(i * i for i in range(start, end + 1))

with open(f'results/chunk_{args.chunk_id}.txt', 'w') as f:
    f.write(f"Chunk {args.chunk_id}: {start}–{end}, sum = {result}\n")

print(f"Chunk {args.chunk_id} complete.")

#!/bin/bash
#SBATCH --array=1-10
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=8G
#SBATCH --time=00:30:00
#SBATCH --output=logs/%A_%a.out

mkdir -p results
python process_chunk.py --chunk-id $SLURM_ARRAY_TASK_ID

Ten chunks run in parallel, each writing its own output file. The entire dataset is processed in roughly the time it takes for one chunk — this is the core benefit of parallel computing.

A few more array job tricks worth knowing:

# Only run 4 tasks at a time (useful if you have limited licenses or quota)
#SBATCH --array=1-100%4

# Run specific indices, not a continuous range
#SBATCH --array=1,3,7,15

# Cancel just one sub-job
scancel 123456_3

# Cancel all sub-jobs in an array
scancel 123456

8. Job Dependencies

Sometimes job B should only start after job A finishes. SLURM handles this with --dependency, which is useful for building pipelines where preprocessing, training, and evaluation are separate steps.

# Submit preprocessing, capture its job ID
preprocess_id=$(sbatch --parsable preprocess.sh)
echo "Preprocessing job: $preprocess_id"

# Submit training, only starts if preprocessing exits 0
train_id=$(sbatch --parsable --dependency=afterok:$preprocess_id train.sh)
echo "Training job: $train_id"

# Submit evaluation, only starts if training succeeds
sbatch --dependency=afterok:$train_id evaluate.sh

Dependency types you'll actually use:

afterok:<id> — start only if the dependency exited successfully (code 0)
afterany:<id> — start regardless of how the dependency exited
afternotok:<id> — start only if the dependency failed (useful for cleanup or alert scripts)
afterok:<id1>:<id2> — start only after multiple jobs all succeed

9. Interactive Sessions

Sometimes you need to debug on a compute node directly. Batch jobs are too slow a feedback loop for this — every iteration requires submitting, waiting, and reading a log file. Interactive sessions via srun solve this:

srun --pty --ntasks=1 --cpus-per-task=4 --mem=16G --time=01:00:00 --partition=short bash

This puts you in a real shell on a compute node where you can run commands directly, check that your environment loads correctly, and test whether your script produces sensible output before committing to a full run.

For GPU debugging:

srun --pty --ntasks=1 --cpus-per-task=4 --mem=32G --gres=gpu:1 --partition=gpu --time=01:00:00 bash

# Once inside, check the GPU is visible
nvidia-smi

# Quick Python check
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

Don't idle in interactive sessions. You hold the allocated resources the entire time you're connected, even if you're just reading a file or thinking. Use interactive sessions for quick debugging, then exit and submit a proper batch job for real work.

10. A Real Training Script

Putting it all together, here's what I'd actually use for a multi-day training run:

#!/bin/bash
#SBATCH --job-name=finetune_v2
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --gres=gpu:a100:1
#SBATCH --time=72:00:00
#SBATCH --output=logs/%x-%j.out
#SBATCH --error=logs/%x-%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=your@email.com

# Log start time and node
echo "========================================="
echo "Job:     $SLURM_JOB_ID"
echo "Node:    $SLURMD_NODENAME"
echo "Start:   $(date)"
echo "========================================="

# Load modules
module load cuda/12.8
module load cudnn/9.5.0_cuda12

# Activate environment
source activate myenv

# Confirm GPU is visible
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader

# Run — time it automatically
time python finetune.py \
    --config configs/base.yaml \
    --output-dir checkpoints/run_$SLURM_JOB_ID

echo "Finished: $(date)"

A few things worth noting. The echo lines at the start let you see exactly when and where the job ran, which is invaluable if something goes wrong or if you need to report an issue to the sysadmins. --mail-type=END,FAIL means you get an email when the job finishes or crashes — you don't need to keep checking squeue manually. Passing $SLURM_JOB_ID to the output directory means each run's checkpoints land in their own folder, so you never accidentally overwrite a good checkpoint.

11. Debugging When Things Go Wrong

Jobs fail. Here's a quick checklist, ordered by how often each issue actually comes up.

Job stays PD (Pending) indefinitely

Check the REASON column in squeue:

squeue -u $USER -o "%.18i %.9P %.30j %.8u %.8T %.10M %.9l %.6D %R"

Resources — normal, waiting for resources to free up
ReqNodeNotAvail — your resource request can't be satisfied by any node; reduce memory, time, or fix GPU name
QOSMaxJobsPerUserLimit — you've hit a per-user job limit; wait for existing jobs to finish
Priority — other users have higher priority; your job will run eventually

Job starts but immediately exits with an error

Check the .err file first — the answer is almost always in there:

cat logs/finetune_v2-123456.err

Common causes: forgot to activate conda environment, wrong Python path, CUDA version mismatch, or a missing output directory that your script assumes exists.

"Requested node configuration is not available"

Your resource combination doesn't exist. Common culprits: GPU name typo, memory larger than any single node, or requesting a GPU in a CPU-only partition. Fix: run scontrol show nodes | grep Gres to get the exact GPU name, and check per-partition limits with sinfo -l.

Job runs but produces wrong results

For array jobs, double-check that $SLURM_ARRAY_TASK_ID is wired up correctly and that output files aren't overwriting each other. Print the ID at the start of your script to confirm:

echo "Array task ID: $SLURM_ARRAY_TASK_ID"

Out of memory (OOM) kill

The job dies silently or you see Killed in the log. Check with:

sacct -j <job_id> --format=JobID,State,MaxRSS,Elapsed

If State shows OUT_OF_MEMORY, increase --mem and resubmit.

12. Habits Worth Developing Early

Always test with a short time limit first. Before submitting a 3-day job, run the same script with --time=00:15:00 on a small subset of data. You'll catch environment issues, path errors, and obvious bugs before wasting queue time on a job that crashes after 10 minutes.

Keep your resource requests honest. Asking for 512G of RAM when your job uses 40G takes resources away from other users and makes your own job harder to schedule. Most clusters have accounting — admins notice patterns, and users who consistently over-request resources often find themselves deprioritised.

Save working scripts as templates. Once you have a script that submits cleanly, runs correctly, and uses reasonable resources, keep it. You'll run similar jobs dozens of times across a project. Having a known-good template prevents a whole class of bugs.

Use --mail-type=END,FAIL. There's no reason to watch squeue manually. Let SLURM email you when something finishes or breaks. You can then check results and resubmit immediately rather than noticing 8 hours later that your job crashed.

Check sacct after every successful job. sacct -j <job_id> --format=JobID,MaxRSS,Elapsed tells you actual memory usage and wall time. Use this to calibrate future resource requests — it's the only reliable way to know what your code actually needs rather than what you guessed.