Guies BibTIC: HPC High Performance Computing: 4. Slurm Job Priorities

Slurm Priority Multifactor

Slurm has the priority/multifactor plugin set, which schedules jobs based on several factors.

The cluster operates on a Basic Multifactor Priority, based on First-In and First-Out scheduling. Where the fair-share hierarchy represents the portion of the computing resources that have been allocated to different projects, these allocations are assigned to an account. There can be multiple levels of allocations made as allocations of a given account are further divided to sub-accounts, in that case an UPF Research Group.

The Fair-share Factor Under An Account Hierarchy presents a system whereby the priority of a user's job is calculated based on the portion of the machine allocated to the user and the historical usage of all the jobs run by that user under a specific account. If there are two members of a given account, and if one of those users has run many jobs under that account, the job priority of a job submitted by the user who has not run any jobs will be negatively affected. This ensures that the combined usage charged to an account matches the portion of the machine that is allocated to that account.

There are several considerations the scheduler makes when making scheduling decisions. Jobs are selected to be evaluated by the scheduler in the following order:

Jobs that can preempt (not in use in ohpc).
Jobs with an advanced reservation (not in use in ohpc).
Partition PriorityTier.
Job priority.
Job submit time.
Job ID.

This is important to keep in mind because the job with the highest priority may not be the first to be evaluated by the scheduler. The job priority is considered when there are multiple jobs that can be evaluated at once, such as jobs requesting partitions with the same PriorityTier.

Age the length of time a job has been waiting in the queue, eligible to be scheduled
Association a factor associated with each association
Fair-share the difference between the portion of the computing resource that has been promised and the amount of resources that has been consumed
Job size the number of nodes or CPUs a job is allocated
Nice a factor that can be controlled by users to prioritize their own jobs.
Partition a factor associated with each node partition
Quality of Service (QOS) a factor associated with each Quality Of Service
Site a factor dictated by an administrator or a site-developed job_submit or site_factor plugin
TRES each TRES Type has its own factor for a job which represents the number of requested/allocated TRES Type in a given partition.

The job's priority at any given time will be a weighted sum of all the factors that have been enabled in the slurm.conf file. Job priority can be expressed as:

Job_priority =

site_factor +

(PriorityWeightAge) * (age_factor) +

(PriorityWeightAssoc) * (assoc_factor) +

(PriorityWeightFairshare) * (fair-share_factor) +

(PriorityWeightJobSize) * (job_size_factor) +

(PriorityWeightPartition) * (priority_job_factor) +

(PriorityWeightQOS) * (QOS_factor) +

SUM(TRES_weight_cpu * TRES_factor_cpu,

TRES_weight_<type> * TRES_factor_<type>,

...)

- nice_factor

All of the factors in this formula are floating point numbers that range from 0.0 to 1.0. The weights are unsigned, 32 bit integers. The job's priority is an integer that ranges between 0 and 4294967295. The larger the number, the higher the job will be positioned in the queue, and the sooner the job will be scheduled.

The Job priority values in the ohpc.s.upf.edu cluster are set to the following.

# scontrol show config | grep ^Priority

PriorityParameters = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife = 30-00:00:00
PriorityCalcPeriod = 00:05:00
PriorityFavorSmall = No
PriorityFlags = SMALL_RELATIVE_TO_TIME,DEPTH_OBLIVIOUS,NO_FAIR_TREE
PriorityMaxAge = 7-00:00:00
PriorityUsageResetPeriod = QUARTERLY
PriorityType = priority/multifactor
PriorityWeightAge = 0
PriorityWeightAssoc = 0
PriorityWeightFairShare = 10000
PriorityWeightJobSize = 1000
PriorityWeightPartition = 10
PriorityWeightQOS = 1000
PriorityWeightTRES = (null)

PriorityType Set this value to "priority/multifactor" to enable the Multifactor Job Priority Plugin.
PriorityDecayHalfLife This determines the contribution of historical usage on the composite usage value. The larger the number, the longer past usage affects fair-share. If set to 0 no decay will be applied. This is helpful if you want to enforce hard time limits per association. If set to 0 PriorityUsageResetPeriod must be set to some interval. The unit is a time string (i.e. min, hr:min:00, days-hr:min:00, or days-hr). The default value is 7-0 (7 days).
PriorityUsageResetPeriod At this interval the usage of associations will be reset to 0. This is used if you want to enforce hard limits of time usage per association. If PriorityDecayHalfLife is set to be 0 no decay will happen and this is the only way to reset the usage accumulated by running jobs. By default this is turned off and it is advised to use the PriorityDecayHalfLife option to avoid not having anything running on your cluster, but if your schema is set up to only allow certain amounts of time on your system this is the way to do it. Applicable only if PriorityType=priority/multifactor. The unit is a time string (i.e. NONE, NOW, DAILY, WEEKLY). The default is NONE.
QUARTERLY Cleared on the first day of each quarter at time 00:00.
PriorityFlags Flags to modify priority behavior. Applicable only if PriorityType=priority/multifactor.
- DEPTH_OBLIVIOUS If set, priority will be calculated based similar to the normal multifactor calculation, but depth of the associations in the tree do not adversely effect their priority. This option automatically enables NO_FAIR_TREE.
- NO_FAIR_TREE Disables the "fair tree" algorithm, and reverts to "classic" fair share priority scheduling.
- SMALL_RELATIVE_TO_TIME If set, the job's size component will be based upon not the job size alone, but the job's size divided by its time limit.
PriorityWeightAge time that the job has been waiting in the queue.
PriorityWeightFairshare amount of past usage by the user in the cluster.
PriorityJobSize the number of (amount of memory, number of nodes, cores, tasks and GPUs) a job requests
PriorityWeightJobSize An unsigned integer that scales the contribution of the job size factor.
PriorityWeightPartition the partition to which a job is submitted.
PriorityWeightQOS a QualityOfService associated to the job. (not in use in ohpc)

Command line examples

1. Displaying the sharing and Fair-Share information of your user in your account.

$ sshare -l -U -o Account,User,NormShares,RawUsage,NormUsage,EffectvUsage,FairShare,TRESRunMins%100
Account User NormShares RawUsage NormUsage EffectvUsage FairShare TRESRunMins
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ResearchGroup1 <user> 0.000705 100942 0.000081 0.000003 0.997186 cpu=127851,mem=465054140,gres/gpu=27134

2. Displaying the FairShare information of all users of your account

$ sshare -a --accounts=<account>
Account User RawShares NormShares RawUsage EffectvUsage FairShare
------------------------------------------------------------------------------------------------------------------------------------
ResearchGroup1 1 0.023256 117684 0.000094 0.997199
ResearchGroup1 <user1> 1 0.000705 16765 0.000003 0.997193
ResearchGroup1 <user2> 1 0.000705 100918 0.000003 0.997187

ResearchGroup2

...

Account the Slurm Account.
User the Slurm User.
Raw Shares used amount of users and accounts, it will drop down depending on PriorityDecayHalfLife value.
Norm Shares The shares assigned to the user or account to the total number of assigned shares, in the example 0.023256/account
Raw Usage the number of TRES-seconds of all the jobs charged to the account or user. This number will decay over time when PriorityDecayHalfLife is defined.
Norm Usage The Raw Usage normalized to the total number of tres-seconds of all jobs run on the cluster, subject to the PriorityDecayHalfLife decay when defined.
Effectv Usage the usage normalized (from 0.0 to 1.0) of an user out of all the users in the system.
FairShare is calculated based on the RawShare and RawUsage, a low value for an account indicates you have consumed too much resources compared to other/s user/s. Another user will have more priority than you..
TRESRunMins used to limit the combined total number of TRES minutes used by all jobs running with this account, this takes into consideration time limit of running jobs and consumes it, if the limit is reached no new jobs are started until other jobs finish to allow time to free up.
- TRES is a combination of a Type and a Name, current TRES can be: CPU, Energy, FS (filesystem), GRES (NvidiaGPU), License, Mem (Memory), Node, Pages and VMem (Virtual Memory/Size)

3. Displaying a summary of the six factors configured that comprise each job’s scheduling priority, this is for information purposes only.

The sprio -w option displays the weights (PriorityWeightAge, PriorityWeightFairshare, etc.) for each factor as it is currently configured.

$ sprio -w

JOBID PARTITION PRIORITY SITE FAIRSHARE JOBSIZE PARTITION QOS

---------------------------------------------------------------------------------------------------------------------

Weights 1 10000 1000 1000 1000

4. Displaying the priority list of the pending jobs, you can see that the job 528955 is the highest priority job in PENDING state: ((Partition) 10 + (Fairshare) 9973 = 9983), you can check the fairshare usage by using sshare utility described in the previous section, which shows all fairshares organised in a tree structure (accounts/users), use the option --accounts to filter by your account.

$ sprio -l
JOBID PARTITION USER PRIORITY SITE AGE ASSOC FAIRSHARE JOBSIZE PARTITION QOS

------------------------------------------------------------------------------------------------------------------------------------------------------------------------
528955 short user_1 9983     0 0 0 9973 0    10 0
528069    high user_2 4 0 0 0 0 0        4 0
527946 high user_2 4 0 0 0 0 0       4 0
528947 high user_3 4 0 0 0 0 0    4 0

Corollary

Some of the highlights of that review are summarized below.

Cluster is working on Slurm's Multifactor Basic mode, jobs are given a strictly first arrived first served priority (FIFO).
Each DTIC Research Group is an Slurm Account.
Every Account has the same number of Shares (1 / TotalAccounts).
FairShare indicates the factor's number for a user's job based on the portion of the machine allocated.
Working users in a specfific account can affect negativily to users who hasn't run any jobs.
Priorities historical usage is cleared on the first day of each quarter at time 00:00.
The weights for a job priority calculation are based on FairShare(10000), JobSize(1000), and Partition (10).