We can monitor our jobs with the squeue command. If we call squeue without arguments, it will show the state of the jobs for all users.
If we execute the squeue command we can see the jobid, the partition where is being processed the job, the name, the user, the state (ST), the time it has been running, the number of nodes where is being executed and the nodelist with the reason code.
test@login01:~/parallel$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 178 short prova_da test CG 59:50 1 node001 173 medium prova_da test CG 1:02:06 1 node001 788 short prova_da test S 0:00 4 node[005-008]
Fig 1. squeue information
Field |
Description |
Comment |
job-ID |
Numerical ID of the job |
Numerical identifier of the job |
PARTITION |
Queue |
Queue where the job runs. |
NAME |
job name |
here we have the name that comes from the "-J" parameter |
USER |
Name of job owner |
The name of the user who owns the job |
ST |
Job status |
Job status. Available states are shown Table 3. |
TIME |
execution time |
time that has passed since the beginning of the task |
NODES |
number of nodes |
number of nodes where the job is being executed |
NODELIST (REASON) |
nodes |
nodes where the job is being executed |
Fig 2. Description of job states
JOB STATE CODES |
|
BF BOOT_FAIL |
Job terminated due to launch failure, typically due to a hardware failure (e.g. unable to boot the node or block and the job can not be requeued). |
CA CANCELLED |
Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. |
CD COMPLETED |
Job has terminated all processes on all nodes with an exit code of zero. |
CF CONFIGURING |
Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting). |
CG COMPLETING |
Job is in the process of completing. Some processes on some nodes may still be active. |
F FAILED |
Job terminated with non-zero exit code or other failure condition. |
NF NODE_FAIL |
Job terminated due to failure of one or more allocated nodes. |
PD PENDING |
Job is awaiting resource allocation. |
PR PREEMPTED |
Job terminated due to preemption. |
R REVOKED |
Sibling was removed from cluster due to other cluster starting the job. |
R RUNNING |
Job currently has an allocation. |
SE SPECIAL_EXIT |
The job was requeued in a special state. This state can be set by users, typically in EpilogSlurmctld, if the job has terminated with a particular exit value. |
ST STOPPED |
Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job. |
S SUSPENDED |
Job has an allocation, but execution has been suspended and CPUs have been released for other jobs. |
TO TIMEOUT |
Job terminated upon reaching its time limit. |
Fig 3 squeue command options
Command |
Option |
Comment |
---|---|---|
-u |
<user> |
Show the jobs for a given users |
-j |
<job_id> |
Shows scheduling options for a given job_id |
-o |
%A |
shows jobid |
-o |
%C |
Number of CPUs (processors) requested by the job |
-o |
%N |
List of nodes allocated to the job or job step. |
-o |
%T |
job state |
test@login01:~/parallel$ squeue -j 846 -o %A.%C.%N.%T JOBID.CPUS.NODELIST.STATE 846.16.node[005-008].RUNNING
If we want to check the general state of nodes we can use the sinfo tool.
For example, if we execute sinfo –Nel –n node009 (only to check the node009 state)
test@login01:~/parallel$ sinfo -Nel -n node009 Mon Dec 4 20:20:19 2017 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON node009 1 short* idle 48 2:12:2 96000 0 1 intel,bd none node009 1 medium idle 48 2:12:2 96000 0 1 intel,bd none node009 1 high idle 48 2:12:2 96000 0 1 intel,bd none
There’s another interesting command to monitor our jobs: smap
x....AAAAAAAA....####. x xTue Dec 05 14:55:43 2017 x xID JOBID PARTITION USER NAME ST TIME NODES NODELIST x xA 854 short test prova_dan R 00:00:01 8 node[005-012] x x x