A cluster queue is a resource that can handle and execute user jobs. Depending on the job's demands, the job will be executed on a given queue or another. Every queue has its own limits, behavior and default values. Currently, slurm cluster has three different queues shown on the following table:
Queue name | Allowed use |
---|---|
short | Time limit: 2:00 |
medium | Time limit: 8:00 |
high | Time limit: 14 days |
high-cpu | Time limit: 14 days (in nodes without GPU) |
All queues are defined with some common parameters. Unless specified otherwise, these parameters are inherited by all the jobs that run on these queues. This imposes limits, for example, on time or consumed resources for the jobs that run inside a given queue. Let's see, for example, the configuration of the queue short:
test@login01:~$ scontrol show partition short PartitionName=short AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=02:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=node[001-018,020] PriorityJobFactor=100 PriorityTier=100 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 PreemptMode=OFF State=UP TotalCPUs=896 TotalNodes=19 SelectTypeParameters=NONE
When a given user registers a job on the scheduler, limits are applied. If the job's requeriments are higher than the available resources, the job will wait on the queue until the resources get free. But if the job's requeriments are higher than the limits, the job cannot be registered. The limits are setup at three different levels: user, research group and queue.
partition name | short |
default | yes |
wall time | 2 hours |
priority | 100 |
partition name | medium |
default | no |
wall time | 8 hours |
priority | 75 |
partition name | high |
default | no |
wall time | unlimited |
priority | 40 |
partition name | high-cpu |
default | no |
wall time | unlimited |
priority | 50 |