HPC High Performance Computing: 5.8. Monitoring

Monitoring jobs

We can monitor our jobs with the squeue command. If we call squeue without arguments, it will show the state of the jobs for all users.

If we execute the squeue command we can see the jobid, the partition where is being processed the job, the name, the user, the state (ST), the time it has been running,  the number of nodes where is being executed and the nodelist with the reason code. 

test@login01:~/parallel$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               178     short prova_da  test CG      59:50      1 node001
               173    medium prova_da  test CG    1:02:06      1 node001
               788     short prova_da  test  S       0:00      4 node[005-008]

Fig 1. squeue information

Field

Description

Comment

job-ID

Numerical ID of the job

Numerical identifier of the job 

PARTITION

Queue

Queue where the job runs. 

NAME

job name

here we have the name that comes from the "-J" parameter

USER

Name of job owner

The name of the user who owns the job

ST

Job status

Job status. Available states are shown Table 3.

TIME

execution time

time that has passed since the beginning of the task

NODES

number of nodes

number of nodes where the job is being executed

NODELIST (REASON)

nodes

nodes where the job is being executed

Fig 2. Description of job states

 

JOB STATE CODES

BF BOOT_FAIL

Job terminated due to launch failure, typically due to a hardware failure (e.g. unable to boot the node or block and the job can not be requeued).

CA CANCELLED

Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.

CD COMPLETED

Job has terminated all processes on all nodes with an exit code of zero.

CF CONFIGURING

Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).

CG COMPLETING

Job is in the process of completing. Some processes on some nodes may still be active.

F FAILED

Job terminated with non-zero exit code or other failure condition.

NF NODE_FAIL

Job terminated due to failure of one or more allocated nodes.

PD PENDING

Job is awaiting resource allocation.

PR PREEMPTED

Job terminated due to preemption.

R REVOKED

Sibling was removed from cluster due to other cluster starting the job.

R RUNNING

Job currently has an allocation.

SE SPECIAL_EXIT

The job was requeued in a special state. This state can be set by users, typically in EpilogSlurmctld, if the job has terminated with a particular exit value.

ST STOPPED

Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job.

S SUSPENDED

Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.

TO TIMEOUT

Job terminated upon reaching its time limit.

Fig 3 squeue command options

Command

Option

Comment

-u

<user>

Show the jobs for a given users

-j

<job_id>

Shows scheduling options for a given job_id

-o

%A

shows jobid

-o

%C

Number of CPUs (processors) requested by the job

-o

%N

List of nodes allocated to the job or job step.

-o

%T

job state

  • There are a lot of command options that you can check with man squeue
test@login01:~/parallel$ squeue -j 846 -o %A.%C.%N.%T
JOBID.CPUS.NODELIST.STATE
846.16.node[005-008].RUNNING

If we want to check the general state of nodes we can use the sinfo tool.

For example, if we execute sinfo –Nel –n  node009 (only to check the node009 state)

test@login01:~/parallel$ sinfo -Nel -n node009
Mon Dec  4 20:20:19 2017
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
node009        1               short*             idle      48       2:12:2      96000              0      1          intel,bd      none                
node009        1              medium          idle     48       2:12:2       96000              0      1          intel,bd      none                
node009        1                high                idle     48       2:12:2        96000             0      1         intel,bd       none         

There’s another interesting command to monitor our jobs: smap

x....AAAAAAAA....####.                                                                                          x
 
 
 
 
 
 
 
 
 
xTue Dec 05 14:55:43 2017                                                                                       x
xID JOBID              PARTITION USER     NAME      ST      TIME NODES NODELIST                                 x
xA  854                short     test  prova_dan R   00:00:01     8 node[005-012]                            x
x                                                                                                               x