Guies BibTIC: HPC High Performance Computing: 5.1. Cluster queues, resources and limits

Tools

Cluster queues

A cluster queue is a resource that can handle and execute user jobs. Depending on the job's demands, the job will be executed on a given queue or another. Every queue has its own limits, behavior and default values. Currently, slurm cluster has three different queues shown on the following table:

Queue name	Allowed use
short	Time limit: 2:00
medium	Time limit: 8:00
high	Time limit: 14 days
high-cpu	Time limit: 14 days (in nodes without GPU)

All queues are defined with some common parameters. Unless specified otherwise, these parameters are inherited by all the jobs that run on these queues. This imposes limits, for example, on time or consumed resources for the jobs that run inside a given queue. Let's see, for example, the configuration of the queue short:

test@login01:~$ scontrol show partition short
PartitionName=short
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=02:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=node[001-018,020]
   PriorityJobFactor=100 PriorityTier=100 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 PreemptMode=OFF
   State=UP TotalCPUs=896 TotalNodes=19 SelectTypeParameters=NONE

Cluster limits

When a given user registers a job on the scheduler, limits are applied. If the job's requeriments are higher than the available resources, the job will wait on the queue until the resources get free. But if the job's requeriments are higher than the limits, the job cannot be registered. The limits are setup at three different levels: user, research group and queue.

partition name	short
default	yes
wall time	2 hours
priority	100

partition name	medium
default	no
wall time	8 hours
priority	75

partition name	high
default	no
wall time	unlimited
priority	40

partition name	high-cpu
default	no
wall time	unlimited
priority	50