HPC High Performance Computing: 5.2. Submitting basic jobs

Submitting basic jobs

The basic command to send a job is sbatch

test@login01:~$ sbatch sleep.sh
Submitted batch job 175

Once registered, the scheduler tells the job ID. Keep it handy, in case of problems you'll need this value to debug what happens.

Although you can specify the options when calling sbatch, when submitting a job execution request it is strongly advised that all options be provided within a job definition file (in these examples the file will be called “job.sh"). This file will contain the command you wish to execute and SLURM resource request options that you need.

#!/bin/bash
#SBATCH -J prova_uname10
#SBATCH -p short
#SBATCH -N 1
#SBATCH -n 2 
#SBATCH --chdir=/homedtic/test/slurm_jobs
#SBATCH --time=2:00
#SBATCH -o %N.%J.out # STDOUT
#SBATCH -e %N.%j.err # STDERR
 
ps -ef | grep slurm
uname -a >> /homedtic/test/uname.txt

The "-J" option sets the name of the job. This name is used to create the output log files for the job. We recommend using a capital letter for the job name is order to distinguish these log files from the other files in your working directory. This makes it easier to delete the log files later.

The "-p" option requests the queue in which the job should run. 

The "-N" option Request that a minimum of nodes be allocated to this job.

The "-n" option Request the number of tasks per node.

The “–time” option specify a a limit on the total run time of the job allocation.

The “–chdir” option  Set the working directory of the batch script to directory before it is executed.

The "-o" option instruct Slurm to connect the batch script's standard output directly to the file name specified in the "filename pattern".

The "-e" option instruct Slurm to connect the batch script's standard error directly to the file name specified in the "filename pattern".

We can monitor how our job is doing with the scontrol show job command: 

test@login01:~$ scontrol show job 173

   JobId=173 JobName=prova_uname10
   UserId=test(1039) GroupId=info_users(10376) MCS_label=N/A
   Priority=6501 Nice=0 Account=info QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=08:00:00 TimeMin=N/A
   SubmitTime=2017-11-27T16:37:47 EligibleTime=2017-11-27T16:37:47
   StartTime=2017-11-27T16:50:21 EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=medium AllocNode:Sid=node009:5743
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=node[004,020]
   NumNodes=2-2 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,mem=4096,node=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=1024M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/homedtic/test/sleep.sh
   WorkDir=/homedtic/test
   StdErr=/homedtic/test/slurm.%N.173.err
   StdIn=/dev/null
   StdOut=/homedtic/test/slurm.%N.173.out
   Power=

The scontrol show job output shows the task ID of our job (jobId), priority (priority) who has launched it (userId), which is the state of the job (JobState) , when and where it has been registered (submitTime and partition), and how many slots it's been using (slots).

Job run time

Each queue has specific policies that enforce how long jobs are allowed to execute: short* queues allow up to 2 hours, medium* queue allow up to  8 hours, and high* queues have no runtime limit. When you submit a job, you are either implicitly or explicitly indicating how long the job is expected to run. The first way to indicate the maximum runtime using the format

#!/bin/bash
#SBATCH -J prova_uname10
#SBATCH -p short
#SBATCH --time=2:00
 
ps -ef | grep slurm
srun -n8 uname -a >> $HOME/uname.txt

Here we have an example of time limit.  This script will stop when it reaches 2 minutes of execution. If the requested time limit exceeds the partition's time limit, the job will be left in a PENDING state (possibly indefinitely).

Redirection output and error files

By default, if not specified otherwise, the scheduler will redirect the output of anuy job you launch to a couple of files, placed on your actual directory called <slurm-jobid.out> After a few executions and test, probably your $HOME will look like this:

As a general rule, you are advised to use the following flags to redirect the input and error files:

Flag Request Comment
-e <path>/<filename> Redirect error file The system will create the given file on the path specified and will redirect the job's error file here. If name is not specified, default name will apply.
-o <path>/<filename> Redirect output file The system will create the given file on the path specified and will redirect the job's output file here. If name is not specified, default name will apply.
–workdir Change error and output to the working directory The output file and the error file will be placed in the directory from which 'sbatch' is called.

Of course, we can place this option on our job definition file:

#!/bin/bash
#SBATCH -J prova_dani_uname10
#SBATCH -p short
#SBATCH -N 1
#SBATCH -n 2 # number of cores
#SBATCH --chdir=/homedtic/test/slurm_jobs
#SBATCH --time=2:00
#SBATCH -o slurm.%N.%J.%u.out # STDOUT
#SBATCH -e slurm.%N.%J.%u.err # STDERR
 
ps -ef | grep slurm

And when launching the job, we'll see the output files created at /homedtic/test/slurm_jobs. If no error is reported, an empty file will be created.

test@node009:~/slurm_jobs$ ls
slurm.node005.206.test.err  slurm.node005.206.test.out
test@node009:~/slurm_jobs$ 
test@node009:~/slurm_jobs$ 
test@node009:~/slurm_jobs$ cat slurm.node005.206.test.out 
root      1138     1  0 Nov24 ?        00:00:00 /usr/sbin/slurmd
root      6714     1  0 20:04 ?        00:00:00 slurmstepd: [206]   
test   6719  6714  0 20:04 ?        00:00:00 /bin/bash /var/spool/slurmd/job00206/slurm_script
test   6724  6719  0 20:04 ?        00:00:00 grep slurm
test@node009:~/slurm_jobs$ 

Deleting and modifying jobs

We can modify the requeriments of a job while it's waiting to be processed. Once it's

Flag Request Comment
scancel <job_id> Delete the job The system will remove the job and all its dependencies from the queues and the execution hosts

Below I show a table of job reasons

JOB REASON CODES

AssociationJobLimit

The job's association has reached its maximum job count.

AssociationResourceLimit

The job's association has reached some resource limit.

AssociationTimeLimit

The job's association has reached its time limit.

BadConstraints

The job's constraints can not be satisfied.

BeginTime

The job's earliest start time has not yet been reached.

BlockFreeAction

An IBM BlueGene block is being freed and can not allow more jobs to start.

BlockMaxError

An IBM BlueGene block has too many cnodes in error state to allow more jobs to start.

Cleaning

The job is being requeued and still cleaning up from its previous execution.

Dependency

This job is waiting for a dependent job to complete.

FrontEndDown

No front end node is available to execute this job.

InactiveLimit

The job reached the system InactiveLimit.

InvalidAccount

The job's account is invalid.

InvalidQOS

The job's QOS is invalid.

JobHeldAdmin

The job is held by a system administrator.

JobHeldUser

The job is held by the user.

JobLaunchFailure

The job could not be launched. This may be due to a file system problem, invalid program name, etc.

Licenses

The job is waiting for a license.

NodeDown

A node required by the job is down.

NonZeroExitCode

The job terminated with a non-zero exit code.

PartitionDown

The partition required by this job is in a DOWN state.

PartitionInactive

The partition required by this job is in an Inactive state and not able to start jobs.

PartitionNodeLimit

The number of nodes required by this job is outside of it's partitions current limits. Can also indicate
that required nodes are DOWN or DRAINED.

PartitionTimeLimit

The job's time limit exceeds it's partition's current time limit.

Priority

One or more higher priority jobs exist for this partition or advanced reservation.

Prolog

It's PrologSlurmctld program is still running.

QOSJobLimit

The job's QOS has reached its maximum job count.

QOSResourceLimit

The job's QOS has reached some resource limit.

QOSTimeLimit

The job's QOS has reached its time limit.

ReqNodeNotAvail

Some node specifically required by the job is not currently available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding. Nodes which are DOWN, DRAINED, or not responding will be identified as part of the job's "reason" field as "UnavailableNodes". Such nodes will typically require the intervention of a system administrator to make available.

Reservation

The job is waiting its advanced reservation to become available.

Resources

The job is waiting for resources to become available.

SystemFailure

Failure of the Slurm system, a file system, the network, etc.

TimeLimit

The job exhausted its time limit.

QOSUsageThreshold

Required QOS threshold has been breached.

WaitingForScheduling

No reason has been set for this job yet. Waiting for the scheduler to determine the appropriate reason.