HPC High Performance Computing: 5.1. Cluster queues, resources and limits

Cluster queues

A cluster queue is a resource that can handle and execute user jobs. Depending on the job's demands, the job will be executed on a given queue or another. Every queue has its own limits, behavior and default values. Currently, slurm cluster has three different queues shown on the following table:

Queue name Allowed use
short Time limit:  2:00
medium Time limit:  8:00
high Time limit: 14 days
high-cpu Time limit: 14 days (in nodes without GPU)

All queues are defined with some common parameters. Unless specified otherwise, these parameters are inherited by all the jobs that run on these queues. This imposes limits, for example, on time or consumed resources for the jobs that run inside a given queue. Let's see, for example, the configuration of the queue short:

test@login01:~$ scontrol show partition short
PartitionName=short
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=02:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=node[001-018,020]
   PriorityJobFactor=100 PriorityTier=100 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 PreemptMode=OFF
   State=UP TotalCPUs=896 TotalNodes=19 SelectTypeParameters=NONE

Cluster limits

When a given user registers a job on the scheduler, limits are applied. If the job's requeriments are higher than the available resources, the job will wait on the queue until the resources get free. But if the job's requeriments are higher than the limits, the job cannot be registered. The limits are setup at three different levels: user, research group and queue.

partition name short
default yes
wall time 2 hours
priority 100

 

partition name medium
default no
wall time 8 hours
priority 75

 

partition name high
default no
wall time unlimited
priority 40

 

partition name high-cpu
default no
wall time unlimited
priority 50