Guies BibTIC: HPC High Performance Computing: 5.3. Submitting array jobs

Tools

Submitting an array of jobs

Sometimes you'll need to submit a lot of jobs in a single script. In this case, perhaps you'll need to handle thousands of independent jobs and its own input, output and error files. As a general rule, it's not a good idea generate thousands of separate job scripts which are submitted to the cluster.

Slurm allows users to submit a special kind of job which executes a single script with N different input files. This is called 'array job' and it's submitted to the cluster only once and can be managed by as a single job.

To create an array job is slightly different than create a single job, as long as we have to manage the N different input jobs. In this example, we'll simply copying the contents of the N different input files to the output files:

First, we create the generic files to place the input data. As a general rule, the easiest way to name the input files is to append an integer at the end of the input file, and use the environment variable SLURM_ARRAY_TASK_ID to handle them. With the following 'for' command we create ten input, empty files:

for i in {1..10}; do echo "File $i" > data-$i.log; done

Let's verify the input files:

test@login01:~/array_jobs$ ls -l
total 6
-rwxr-xr-x 1 test info_users 441 Nov 29 17:40 array.sh
-rw-r--r-- 1 test info_users   8 Nov 29 17:26 data-10.log
-rw-r--r-- 1 test info_users   7 Nov 29 17:26 data-1.log
-rw-r--r-- 1 test info_users   7 Nov 29 17:26 data-2.log
-rw-r--r-- 1 test info_users   7 Nov 29 17:26 data-3.log
-rw-r--r-- 1 test info_users   7 Nov 29 17:26 data-4.log
-rw-r--r-- 1 test info_users   7 Nov 29 17:26 data-5.log
-rw-r--r-- 1 test info_users   7 Nov 29 17:26 data-6.log
-rw-r--r-- 1 test info_users   7 Nov 29 17:26 data-7.log
-rw-r--r-- 1 test info_users   7 Nov 29 17:26 data-8.log
-rw-r--r-- 1 test info_users   7 Nov 29 17:26 data-9.log

When creating an array job, --array flag will handle the way the scheduler will take the input files.

-a, --array=<indexes>

Submit a job array, multiple jobs to be executed with identical parameters. The indexes specification identifies what array index values should be used. Multiple values may be specified using a comma separated list and/or a range of values with a "-" separator. For example, "--array=0-15" or "--array=0,6,16-32". A step function can also be specified with a suffix containing a colon and number. For example, "--array=0-15:4" is equivalent to "--array=0,4,8,12". A maximum number of simultaneously running tasks from the job array may be specified using a "%" separator. For example "--array=0-15%4" will limit the number of simultaneously running tasks from this job array to 4. The minimum index value is 0. the maximum value is one less than the configuration parameter MaxArraySize.

In the example, we want to pick ten files, starting on 1, ending at 10, and pick them one after another. So, out -t option has to be -array 1-10:1 (start on one, end on 10, increment 1 each time):

#!/bin/bash
#SBATCH -J test_array10
#SBATCH -p short
#SBATCH -N 1
#SBATCH -n 2 #number of tasks
#SBATCH --array=1-10:1
#SBATCH -o slurm.%N.%J.%u.%a.out # STDOUT
#SBATCH -e slurm.%N.%J.%u.%a.err # STDERR
 
 
cat data-${SLURM_ARRAY_TASK_ID}.log > output-${SLURM_ARRAY_TASK_ID}.log

Submit array job using sbatch command:

test@login01:~/array_jobs$ sbatch array.sh 
Submitted batch job 720

Monitor job status using “squeue” command, 10 jobs running on short queue. You can see duplicated 10 times the job-ID 153672 on different nodes in the cluster. This time, the squeue command shows an extension en the jobid column

                 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          118254_1     short test_arr  test  R       0:00      1 node028
          118254_2     short test_arr  test  R       0:00      1 node028
          118254_3     short test_arr  test  R       0:00      1 node028
          118254_4     short test_arr  test  R       0:00      1 node028
          118254_5     short test_arr  test  R       0:00      1 node028
          118254_6     short test_arr  test  R       0:00      1 node028
          118254_7     short test_arr  test  R       0:00      1 node028
          118254_8     short test_arr  test  R       0:00      1 node028
          118254_9     short test_arr  test  R       0:00      1 node028
         118254_10     short test_arr  test  R       0:00      1 node030

Output files are stored where script file have output-* defined.

test@login01:~/array_jobs$ ls -l output-*
-rw-r--r-- 1 test info_users 8 Nov 29 19:02 output-10.log
-rw-r--r-- 1 test info_users 7 Nov 29 19:02 output-1.log
-rw-r--r-- 1 test info_users 7 Nov 29 19:02 output-2.log
-rw-r--r-- 1 test info_users 7 Nov 29 19:02 output-3.log
-rw-r--r-- 1 test info_users 7 Nov 29 19:02 output-4.log
-rw-r--r-- 1 test info_users 7 Nov 29 19:02 output-5.log
-rw-r--r-- 1 test info_users 7 Nov 29 19:02 output-6.log
-rw-r--r-- 1 test info_users 7 Nov 29 19:02 output-7.log
-rw-r--r-- 1 test info_users 7 Nov 29 19:02 output-8.log
-rw-r--r-- 1 test info_users 7 Nov 29 19:02 output-9.log

The content of every file is the same than his parent input file, as we have specified in our script:

test@login01:~/array_jobs$ cat output-1.log
File 1
test@login01:~/array_jobs$ cat output-6.log
File 6
test@login01:~/array_jobs$

To delete all jobs of an array, execute command scancel specifying the job-ID of the whole array job. It will mark all ten independent jobs for deletion:

test@login01:~/array_jobs$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               178     short prova_da  test CG      59:50      1 node001
               173    medium prova_da  test CG    1:02:06      1 node001
               184     short prova_da  test CG       1:06      1 node001
                92     short prova_da  test CG       4:18      1 node004
                91     short prova_da  test CG       4:25      1 node004
             730_2     short prova_da  test  R       0:01      4 node[006-008]
             730_3     short prova_da  test  R       0:01      4 node[006-008]
             730_4     short prova_da  test  R       0:01      4 node[006-008]
             730_5     short prova_da  test  R       0:01      4 node[006-008]
             730_6     short prova_da  test  R       0:01      4 node[006-008]
             730_7     short prova_da  test  R       0:01      4 node[006-008]
             730_8     short prova_da  test  R       0:01      4 node[006-008]
             730_9     short prova_da  test  R       0:01      4 node[006-008]
            730_10     short prova_da  test  R       0:01      4 node[006-008]
             730_1     short prova_da  test  S       0:00      4 node[006-008]
test@login01:~/array_jobs$ scancel 730

To delete a single job of an array, specify the TASK_ID at the ending of the job-ID. The syntax should follow this way: scancel array-job-id <dot> element_job_id:

test@login01:~/array_jobs$ scancel 750_8
test@login01:~/array_jobs$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               178     short prova_da  test CG      59:50      1 node001
               173    medium prova_da  test CG    1:02:06      1 node001
             750_8     short prova_da  test CG       0:07      1 node006
               184     short prova_da  test CG       1:06      1 node001
                92     short prova_da  test CG       4:18      1 node004
                91     short prova_da  test CG       4:25      1 node004
             750_2     short prova_da  test  R       0:08      4 node[006-008]
             750_3     short prova_da  test  R       0:08      4 node[006-008]
             750_4     short prova_da  test  R       0:08      4 node[006-008]
             750_5     short prova_da  test  R       0:08      4 node[006-008]
             750_6     short prova_da  test  R       0:08      4 node[006-008]
             750_7     short prova_da  test  R       0:08      4 node[006-008]
             750_9     short prova_da  test  R       0:08      4 node[006-008]
            750_10     short prova_da  test  R       0:08      4 node[006-008]
             750_1     short prova_da  test  S       0:00      4 node[006-008]

How it works (detailed)

We will submit 10 array job tasks in order to see which is the array job behavior.

Example 1: The job reserves 2 nodes and 4 CPUs for each task.

#!/bin/bash

#SBATCH -J dani_sof

#SBATCH -p high

#SBATCH -N 2

#SBATCH -n 4

#SBATCH --array=1-10:1

#SBATCH --chdir=/home/test/Proves

#SBATCH --mem=4G

# echo "-n 4"

For each task of the array job, slurm will allocate 2 nodes and 4 CPUs for each task (in total). However, it does not mean that slurm uses 4 CPUs for each task, in this case, we have a Matlab process that does not use multiple cores for each thread. Another important thing is that slurm allocates 2 nodes for each task, but only use 1 node because the Matlab process cannot run in multi-node mode (openmpi is not supported). For this reason, node002 will never run any job.

Therefore, the “N” parameter must be “1”.

root@node003:~# squeue -u test

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

1699284_1 high dani_sof test R 0:01 2 node[001-002]

1699284_2 high dani_sof test R 0:01 2 node[001-002]

1699284_3 high dani_sof test R 0:01 2 node[001-002]

1699284_4 high dani_sof test R 0:01 2 node[001-002]

1699284_5 high dani_sof test R 0:01 2 node[001-002]

1699284_6 high dani_sof test R 0:01 2 node[001-002]

1699284_7 high dani_sof test R 0:01 2 node[001-002]

1699284_8 high dani_sof test R 0:01 2 node[001-002]

1699284_9 high dani_sof test R 0:01 2 node[001-002]

1699284_10 high dani_sof test R 0:01 2 node[001-002]

Example 2: In this case, we reserve only 1 node and 4 CPUs for each array job task. Pay attention to that: we reserve only 1 node for each task, but it does not mean that slurm will allocate more nodes if more resources are needed.

#!/bin/bash

#SBATCH -J dani_soft

#SBATCH -p high

#SBATCH -N 1

#SBATCH -n 4

#SBATCH --array=1-10:1

#SBATCH --chdir=/home/test/Proves

#SBATCH --mem=4G

# echo "-n 4"

Since we have reserved for 4 CPUs, 1 one node for each task and, there are 10 tasks (10 tasks * 4 CPUs = 40 CPUs). All array job is running in node001 because it is empty.

TRES=cpu=4,mem=4G,node=1

root@node003:~# squeue -u test

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

1699304_1 high dani_sof test R 0:54 1 node001

1699304_2 high dani_sof test R 0:54 1 node001

1699304_3 high dani_sof test R 0:54 1 node001

1699304_4 high dani_sof test R 0:54 1 node001

1699304_5 high dani_sof test R 0:54 1 node001

1699304_6 high dani_sof test R 0:54 1 node001

1699304_7 high dani_sof test R 0:54 1 node001

1699304_8 high dani_sof test R 0:54 1 node001

1699304_9 high dani_sof test R 0:54 1 node001

1699304_10 high dani_sof test R 0:54 1 node001

Example 3: We reserve only 1 node per task, but also we reserve 8 CPUs per task. In total are 10 tasks * 8 CPUs = 80 CPUs.

#!/bin/bash

#SBATCH -J dani_soft

#SBATCH -p high

#SBATCH -N 1

#SBATCH -n 8

#SBATCH --array=1-10:1

#SBATCH --chdir=/home/test/Proves

#SBATCH --mem=4G

# echo "-n 4"

We are reserving 8 CPUs for each array job task, but slurm will use more nodes although we only ask for 1 node in our script. Why? In the example number 2, slurm only allocated a node because 10 tasks of the array * 4 CPUs needed = 40 CPUs (which is a CPU number available in the same node001). In that case, 8 CPUs for each array job task 8 * 10 = 80 CPUs in total, so slurm will allocate 3 nodes. In theory could be possible to allocate only 2 nodes, but probably there are no nodes with 40 CPUs available at that moment.

root@node003:~# squeue -u test

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

1699316_1 high dani_sof test R 0:01 1 node002

1699316_2 high dani_sof test R 0:01 1 node001

1699316_3 high dani_sof test R 0:01 1 node001

1699316_4 high dani_sof test R 0:01 1 node001

1699316_5 high dani_sof test R 0:01 1 node001

1699316_6 high dani_sof test R 0:01 1 node001

1699316_7 high dani_sof test R 0:01 1 node001

1699316_8 high dani_sof test R 0:01 1 node003

1699316_9 high dani_sof test R 0:01 1 node003

1699316_10 high dani_sof test R 0:01 1 node003

Important Conclusions

The “n” parameter is the number of cores used by each array job task
The “N” parameter is the number of nodes used by each array job task. When a node finishes its available CPUs to run the array job, slurm will allocate another node with more CPUs available, and so on.
If we have an array job task which cannot use more than one CPU, the “n” parameter must be set at 1, otherwise, we will be asking for more resource that we need.
If we force the array job to use only a node, the number of tasks that we will be able to run will be the same that the CPUs node number/number of cores that we asked to run the task
In the script – and probably that’s the most important conclusion - the configured parameters (memory, node number, core number, etc...) are for each job of the array, not for the entire array job.

In other words, if we ask for 4 nodes we don’t reserve them for all the array jobs, but each task of the array. If the task needs only one node, we will be wasting resources. If we are asking for more cores than the task can use, again we will be wasting resources.