Sometimes you'll need to submit a lot of jobs in a single script. In this case, perhaps you'll need to handle thousands of independent jobs and its own input, output and error files. As a general rule, it's not a good idea generate thousands of separate job scripts which are submitted to the cluster.
Slurm allows users to submit a special kind of job which executes a single script with N different input files. This is called 'array job' and it's submitted to the cluster only once and can be managed by as a single job.
To create an array job is slightly different than create a single job, as long as we have to manage the N different input jobs. In this example, we'll simply copying the contents of the N different input files to the output files:
First, we create the generic files to place the input data. As a general rule, the easiest way to name the input files is to append an integer at the end of the input file, and use the environment variable SLURM_ARRAY_TASK_ID to handle them. With the following 'for' command we create ten input, empty files:
for i in {1..10}; do echo "File $i" > data-$i.log; done
Let's verify the input files:
test@login01:~/array_jobs$ ls -l total 6 -rwxr-xr-x 1 test info_users 441 Nov 29 17:40 array.sh -rw-r--r-- 1 test info_users 8 Nov 29 17:26 data-10.log -rw-r--r-- 1 test info_users 7 Nov 29 17:26 data-1.log -rw-r--r-- 1 test info_users 7 Nov 29 17:26 data-2.log -rw-r--r-- 1 test info_users 7 Nov 29 17:26 data-3.log -rw-r--r-- 1 test info_users 7 Nov 29 17:26 data-4.log -rw-r--r-- 1 test info_users 7 Nov 29 17:26 data-5.log -rw-r--r-- 1 test info_users 7 Nov 29 17:26 data-6.log -rw-r--r-- 1 test info_users 7 Nov 29 17:26 data-7.log -rw-r--r-- 1 test info_users 7 Nov 29 17:26 data-8.log -rw-r--r-- 1 test info_users 7 Nov 29 17:26 data-9.log
When creating an array job, --array flag will handle the way the scheduler will take the input files.
-a, --array=<indexes>
Submit a job array, multiple jobs to be executed with identical parameters. The indexes specification identifies what array index values should be used. Multiple values may be specified using a comma separated list and/or a range of values with a "-" separator. For example, "--array=0-15" or "--array=0,6,16-32". A step function can also be specified with a suffix containing a colon and number. For example, "--array=0-15:4" is equivalent to "--array=0,4,8,12". A maximum number of simultaneously running tasks from the job array may be specified using a "%" separator. For example "--array=0-15%4" will limit the number of simultaneously running tasks from this job array to 4. The minimum index value is 0. the maximum value is one less than the configuration parameter MaxArraySize.
In the example, we want to pick ten files, starting on 1, ending at 10, and pick them one after another. So, out -t option has to be -array 1-10:1 (start on one, end on 10, increment 1 each time):
#!/bin/bash #SBATCH -J test_array10 #SBATCH -p short #SBATCH -N 1 #SBATCH -n 2 #number of tasks #SBATCH --array=1-10:1 #SBATCH -o slurm.%N.%J.%u.%a.out # STDOUT #SBATCH -e slurm.%N.%J.%u.%a.err # STDERR cat data-${SLURM_ARRAY_TASK_ID}.log > output-${SLURM_ARRAY_TASK_ID}.log
Submit array job using sbatch command:
test@login01:~/array_jobs$ sbatch array.sh Submitted batch job 720
Monitor job status using “squeue” command, 10 jobs running on short queue. You can see duplicated 10 times the job-ID 153672 on different nodes in the cluster. This time, the squeue command shows an extension en the jobid column
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 118254_1 short test_arr test R 0:00 1 node028 118254_2 short test_arr test R 0:00 1 node028 118254_3 short test_arr test R 0:00 1 node028 118254_4 short test_arr test R 0:00 1 node028 118254_5 short test_arr test R 0:00 1 node028 118254_6 short test_arr test R 0:00 1 node028 118254_7 short test_arr test R 0:00 1 node028 118254_8 short test_arr test R 0:00 1 node028 118254_9 short test_arr test R 0:00 1 node028 118254_10 short test_arr test R 0:00 1 node030
Output files are stored where script file have output-* defined.
test@login01:~/array_jobs$ ls -l output-* -rw-r--r-- 1 test info_users 8 Nov 29 19:02 output-10.log -rw-r--r-- 1 test info_users 7 Nov 29 19:02 output-1.log -rw-r--r-- 1 test info_users 7 Nov 29 19:02 output-2.log -rw-r--r-- 1 test info_users 7 Nov 29 19:02 output-3.log -rw-r--r-- 1 test info_users 7 Nov 29 19:02 output-4.log -rw-r--r-- 1 test info_users 7 Nov 29 19:02 output-5.log -rw-r--r-- 1 test info_users 7 Nov 29 19:02 output-6.log -rw-r--r-- 1 test info_users 7 Nov 29 19:02 output-7.log -rw-r--r-- 1 test info_users 7 Nov 29 19:02 output-8.log -rw-r--r-- 1 test info_users 7 Nov 29 19:02 output-9.log
The content of every file is the same than his parent input file, as we have specified in our script:
test@login01:~/array_jobs$ cat output-1.log File 1 test@login01:~/array_jobs$ cat output-6.log File 6 test@login01:~/array_jobs$
To delete all jobs of an array, execute command scancel specifying the job-ID of the whole array job. It will mark all ten independent jobs for deletion:
test@login01:~/array_jobs$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 178 short prova_da test CG 59:50 1 node001 173 medium prova_da test CG 1:02:06 1 node001 184 short prova_da test CG 1:06 1 node001 92 short prova_da test CG 4:18 1 node004 91 short prova_da test CG 4:25 1 node004 730_2 short prova_da test R 0:01 4 node[006-008] 730_3 short prova_da test R 0:01 4 node[006-008] 730_4 short prova_da test R 0:01 4 node[006-008] 730_5 short prova_da test R 0:01 4 node[006-008] 730_6 short prova_da test R 0:01 4 node[006-008] 730_7 short prova_da test R 0:01 4 node[006-008] 730_8 short prova_da test R 0:01 4 node[006-008] 730_9 short prova_da test R 0:01 4 node[006-008] 730_10 short prova_da test R 0:01 4 node[006-008] 730_1 short prova_da test S 0:00 4 node[006-008] test@login01:~/array_jobs$ scancel 730
To delete a single job of an array, specify the TASK_ID at the ending of the job-ID. The syntax should follow this way: scancel array-job-id <dot> element_job_id:
test@login01:~/array_jobs$ scancel 750_8
test@login01:~/array_jobs$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
178 short prova_da test CG 59:50 1 node001
173 medium prova_da test CG 1:02:06 1 node001
750_8 short prova_da test CG 0:07 1 node006
184 short prova_da test CG 1:06 1 node001
92 short prova_da test CG 4:18 1 node004
91 short prova_da test CG 4:25 1 node004
750_2 short prova_da test R 0:08 4 node[006-008]
750_3 short prova_da test R 0:08 4 node[006-008]
750_4 short prova_da test R 0:08 4 node[006-008]
750_5 short prova_da test R 0:08 4 node[006-008]
750_6 short prova_da test R 0:08 4 node[006-008]
750_7 short prova_da test R 0:08 4 node[006-008]
750_9 short prova_da test R 0:08 4 node[006-008]
750_10 short prova_da test R 0:08 4 node[006-008]
750_1 short prova_da test S 0:00 4 node[006-008]
We will submit 10 array job tasks in order to see which is the array job behavior.
Example 1: The job reserves 2 nodes and 4 CPUs for each task.
#!/bin/bash #SBATCH -J dani_sof #SBATCH -p high #SBATCH -N 2 #SBATCH -n 4 #SBATCH --array=1-10:1 #SBATCH --chdir=/home/test/Proves #SBATCH --mem=4G # echo "-n 4" |
For each task of the array job, slurm will allocate 2 nodes and 4 CPUs for each task (in total). However, it does not mean that slurm uses 4 CPUs for each task, in this case, we have a Matlab process that does not use multiple cores for each thread. Another important thing is that slurm allocates 2 nodes for each task, but only use 1 node because the Matlab process cannot run in multi-node mode (openmpi is not supported). For this reason, node002 will never run any job.
Therefore, the “N” parameter must be “1”.
root@node003:~# squeue -u test JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1699284_1 high dani_sof test R 0:01 2 node[001-002] 1699284_2 high dani_sof test R 0:01 2 node[001-002] 1699284_3 high dani_sof test R 0:01 2 node[001-002] 1699284_4 high dani_sof test R 0:01 2 node[001-002] 1699284_5 high dani_sof test R 0:01 2 node[001-002] 1699284_6 high dani_sof test R 0:01 2 node[001-002] 1699284_7 high dani_sof test R 0:01 2 node[001-002] 1699284_8 high dani_sof test R 0:01 2 node[001-002] 1699284_9 high dani_sof test R 0:01 2 node[001-002] 1699284_10 high dani_sof test R 0:01 2 node[001-002] |
Example 2: In this case, we reserve only 1 node and 4 CPUs for each array job task. Pay attention to that: we reserve only 1 node for each task, but it does not mean that slurm will allocate more nodes if more resources are needed.
#!/bin/bash #SBATCH -J dani_soft #SBATCH -p high #SBATCH -N 1 #SBATCH -n 4 #SBATCH --array=1-10:1 #SBATCH --chdir=/home/test/Proves #SBATCH --mem=4G # echo "-n 4" |
Since we have reserved for 4 CPUs, 1 one node for each task and, there are 10 tasks (10 tasks * 4 CPUs = 40 CPUs). All array job is running in node001 because it is empty.
TRES=cpu=4,mem=4G,node=1
root@node003:~# squeue -u test JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1699304_1 high dani_sof test R 0:54 1 node001 1699304_2 high dani_sof test R 0:54 1 node001 1699304_3 high dani_sof test R 0:54 1 node001 1699304_4 high dani_sof test R 0:54 1 node001 1699304_5 high dani_sof test R 0:54 1 node001 1699304_6 high dani_sof test R 0:54 1 node001 1699304_7 high dani_sof test R 0:54 1 node001 1699304_8 high dani_sof test R 0:54 1 node001 1699304_9 high dani_sof test R 0:54 1 node001 1699304_10 high dani_sof test R 0:54 1 node001 |
Example 3: We reserve only 1 node per task, but also we reserve 8 CPUs per task. In total are 10 tasks * 8 CPUs = 80 CPUs.
#!/bin/bash #SBATCH -J dani_soft #SBATCH -p high #SBATCH -N 1 #SBATCH -n 8 #SBATCH --array=1-10:1 #SBATCH --chdir=/home/test/Proves #SBATCH --mem=4G # echo "-n 4" |
We are reserving 8 CPUs for each array job task, but slurm will use more nodes although we only ask for 1 node in our script. Why? In the example number 2, slurm only allocated a node because 10 tasks of the array * 4 CPUs needed = 40 CPUs (which is a CPU number available in the same node001). In that case, 8 CPUs for each array job task 8 * 10 = 80 CPUs in total, so slurm will allocate 3 nodes. In theory could be possible to allocate only 2 nodes, but probably there are no nodes with 40 CPUs available at that moment.
root@node003:~# squeue -u test JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1699316_1 high dani_sof test R 0:01 1 node002 1699316_2 high dani_sof test R 0:01 1 node001 1699316_3 high dani_sof test R 0:01 1 node001 1699316_4 high dani_sof test R 0:01 1 node001 1699316_5 high dani_sof test R 0:01 1 node001 1699316_6 high dani_sof test R 0:01 1 node001 1699316_7 high dani_sof test R 0:01 1 node001 1699316_8 high dani_sof test R 0:01 1 node003 1699316_9 high dani_sof test R 0:01 1 node003 1699316_10 high dani_sof test R 0:01 1 node003 |
In other words, if we ask for 4 nodes we don’t reserve them for all the array jobs, but each task of the array. If the task needs only one node, we will be wasting resources. If we are asking for more cores than the task can use, again we will be wasting resources.