Guies BibTIC: HPC High Performance Computing: 5.8. Monitoring

Tools

Monitoring jobs

We can monitor our jobs with the squeue command. If we call squeue without arguments, it will show the state of the jobs for all users.

If we execute the squeue command we can see the jobid, the partition where is being processed the job, the name, the user, the state (ST), the time it has been running, the number of nodes where is being executed and the nodelist with the reason code.

test@login01:~/parallel$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               178     short prova_da  test CG      59:50      1 node001
               173    medium prova_da  test CG    1:02:06      1 node001
               788     short prova_da  test  S       0:00      4 node[005-008]

Fig 1. squeue information

Field	Description	Comment
job-ID	Numerical ID of the job	Numerical identifier of the job
PARTITION	Queue	Queue where the job runs.
NAME	job name	here we have the name that comes from the "-J" parameter
USER	Name of job owner	The name of the user who owns the job
ST	Job status	Job status. Available states are shown Table 3.
TIME	execution time	time that has passed since the beginning of the task
NODES	number of nodes	number of nodes where the job is being executed
NODELIST (REASON)	nodes	nodes where the job is being executed

Fig 2. Description of job states

JOB STATE CODES
BF BOOT_FAIL	Job terminated due to launch failure, typically due to a hardware failure (e.g. unable to boot the node or block and the job can not be requeued).
CA CANCELLED	Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
CD COMPLETED	Job has terminated all processes on all nodes with an exit code of zero.
CF CONFIGURING	Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).
CG COMPLETING	Job is in the process of completing. Some processes on some nodes may still be active.
F FAILED	Job terminated with non-zero exit code or other failure condition.
NF NODE_FAIL	Job terminated due to failure of one or more allocated nodes.
PD PENDING	Job is awaiting resource allocation.
PR PREEMPTED	Job terminated due to preemption.
R REVOKED	Sibling was removed from cluster due to other cluster starting the job.
R RUNNING	Job currently has an allocation.
SE SPECIAL_EXIT	The job was requeued in a special state. This state can be set by users, typically in EpilogSlurmctld, if the job has terminated with a particular exit value.
ST STOPPED	Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job.
S SUSPENDED	Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.
TO TIMEOUT	Job terminated upon reaching its time limit.

Fig 3 squeue command options

Command	Option	Comment
-u	<user>	Show the jobs for a given users
-j	<job_id>	Shows scheduling options for a given job_id
-o	%A	shows jobid
-o	%C	Number of CPUs (processors) requested by the job
-o	%N	List of nodes allocated to the job or job step.
-o	%T	job state

There are a lot of command options that you can check with man squeue

test@login01:~/parallel$ squeue -j 846 -o %A.%C.%N.%T
JOBID.CPUS.NODELIST.STATE
846.16.node[005-008].RUNNING

If we want to check the general state of nodes we can use the sinfo tool.

For example, if we execute sinfo –Nel –n node009 (only to check the node009 state)

test@login01:~/parallel$ sinfo -Nel -n node009
Mon Dec  4 20:20:19 2017
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
node009        1               short*             idle      48       2:12:2      96000              0      1          intel,bd      none                
node009        1              medium          idle     48       2:12:2       96000              0      1          intel,bd      none                
node009        1                high                idle     48       2:12:2        96000             0      1         intel,bd       none

There’s another interesting command to monitor our jobs: smap

x....AAAAAAAA....####.                                                                                          x
 
 
 
 
 
 
 
 
 
xTue Dec 05 14:55:43 2017                                                                                       x
xID JOBID              PARTITION USER     NAME      ST      TIME NODES NODELIST                                 x
xA  854                short     test  prova_dan R   00:00:01     8 node[005-012]                            x
x                                                                                                               x