HPC High Performance Computing: 4.7. Submitting multi-node/multi-gpu jobs


A multi-node/multi GPUs job uses one or more GPUs from different nodes. This configuration is only available on Broadwell nodes (Intel processors), which are connected to the Infiniband network. Some of the softwares/libraries compatible with this technology are:

  • NCCL (NVIDIA Collective Communications Library)
  • MPI (Message Passing Interface)
  • Tensorflow
  • PyTorch
  • Tensorvision
  • Hodorov

Submitting multi-node/multi-gpu jobs

Before writing the script, it is essential to highlight that:

  • We have to specify the number of nodes that we want to use: #SBATCH --nodes=X
  • We have to specify the amount of GPUs per node (with a limit of 5 GPUs per user): #SBATCH --gres=gpu:Y
  • The total number of tasks is equal to the product between the number of nodes and the number of GPUs per node: #SBATCH --ntasks=X·Y
  • We have to distribute the number of tasks per node: #SBATCH --tasks-per-node=(X·Y)/X
  • This configuration is available on nodes with Infiniband network, therefore, it only works on Intel nodes: #SBATCH --constraint=intel
  • We have to discard Ethernet interfaces using the following command, where "-np" is equal to the number of total processes: mpirun -np X·Y --mca oob_tcp_if_exclude docker0,lo,eth0,eth1,eth2,eth3 -bind-to none -mca btl "openib,self,vader" --mca pml ob1 -x NCCL_DEBUG=INFO -x NCCL_P2P_DISABLE=1

In this example, we have configured Hodorov framework:

First of all, we have jumped to one node with GPU with an interactive session: 

test@login01:~$ salloc --gres=gpu:1 

Then, we have downloaded the framework to our home directory:

test@node032:~$ git clone https://github.com/uber/horovod.git

After that, we have loaded the necessary libraries to install and configure the packages. Since Hodorov is a framework for Tensorflow, Keras or PyTorch, we have to load one of these modules to use it:

test@node032:~$ source /etc/profile.d/lmod.sh
test@node032:~$ source /etc/profile.d/zz_hpcnow-arch.sh
test@node032:~$ ml load torchvision/0.2.1-foss-2017a-Python-3.6.4-PyTorch-1.1.0
test@node032:~$ ml load Tensorflow-gpu/1.12.0-foss-2017a-Python-3.6.4
test@node032:~$ ml load NCCL/2.3.7-CUDA-9.0.176

We have installed the packages in local-user mode using "pip":

test@node032:~$ cd horovod/examples
test@node032:/horovod/examples$ HOROVOD_CUDA_HOME=${CUDA_HOME} 
test@node032:/horovod/examples$ HOROVOD_NCCL_HOME=${EBROOTNCCL}
test@node032:/horodov/examples$ HOROVOD_WITH_PYTORCH=1 pip install --no-cache-dir --user horovod

We have configured a multi-node task with the following parameters:

  • Number of nodes: 2
  • Tasks per CPU on each node: 1
  • Processor architecture: Intel
  • RAM per node: 5G
  • GPUs on each node: 1

# SCRIPT NAME: multiGPU.sh

# Partition type
#SBATCH --partition=high

# Number of nodes
#SBATCH --nodes=2

# Number of tasks. 2 (1 per node)
#SBATCH --ntasks=2

# Number of tasks per node
#SBATCH --tasks-per-node=1

# Memory per node. 5 GB (In total, 10 GB)
#SBATCH --mem=5g

# Number of GPUs per node
#SBATCH --gres=gpu:1

# Select Intel nodes (with Infiniband)
#SBATCH --constraint=intel

# Modules
source /etc/profile.d/lmod.sh
source /etc/profile.d/easybuild.sh 
ml load torchvision/0.2.1-foss-2017a-Python-3.6.4-PyTorch-1.1.0
ml load Tensorflow-gpu/1.12.0-foss-2017a-Python-3.6.4
ml load NCCL/2.3.7-CUDA-9.0.176

# Discard Ethernet networks. We have 2 processes
mpirun -np 2 --mca oob_tcp_if_exclude docker0,lo,eth0,eth1,eth2,eth3 -bind-to none -mca btl "openib,self,vader" --mca pml ob1 -x NCCL_DEBUG=INFO -x NCCL_P2P_DISABLE=1 python -u horovod/examples/tensorflow_synthetic_benchmark.py

Finally, we have run the job and we have verified that the task has reserved all resources:

test@node032:~$ sbatch multiGPU.sh
Submitted batch job 824290
test@node032:~$ squeue | grep "test"
            824290      high multiGPU   test  R       1:01      2 node[031-032]
test@node032:~$ scontrol show job 824290
JobId=824290 JobName=multiGPU.sh
   UserId=test(1243) GroupId=info_users(10376) MCS_label=N/A
   Priority=6000 Nice=0 Account=info QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:03:57 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2019-10-01T09:24:07 EligibleTime=2019-10-01T09:24:07
   StartTime=2019-10-01T09:24:08 EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=high AllocNode:Sid=node032:1247
   ReqNodeList=(null) ExcNodeList=(null)
   NumNodes=2 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=5G MinTmpDiskNode=0
   Features=intel DelayBoot=00:00:00
   Gres=gpu:1 Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)

In this case, it's running on two Intel nodes with GPU (node031 and node032) and 10 G (5G + 5G) of memory 

Another way to monitor the execution is by connecting to the selected nodes and running “top” and “nvidia-smi” command. With "htop", we can see the resources being consumed in real time, "nvidia-smi" shows our processes running on GPUs.

But reading the output file is the best choice to check the results:

node032:11589:11639 [0] NCCL INFO comm 0x2aff2c7f44a0 rank 1 nranks 2 - COMPLETEnode031:36278:36329 [0] NCCL INFO comm 0x2b46fc7b2fc0 rank 0 nranks 2 - COMPLETE
node031:36278:36329 [0] NCCL INFO Launch mode Parallel
Running benchmark...
Iter #0: 136.4 img/sec per GPU
Iter #1: 137.9 img/sec per GPU
Iter #2: 138.1 img/sec per GPU
Iter #3: 137.9 img/sec per GPU
Iter #4: 133.8 img/sec per GPU
Iter #5: 133.5 img/sec per GPU
Iter #6: 137.2 img/sec per GPU
Iter #7: 135.2 img/sec per GPU
Iter #8: 138.8 img/sec per GPU
Iter #9: 136.0 img/sec per GPU
Img/sec per GPU: 136.5 +-3.4
Total img/sec on 2 GPU(s): 273.0 +-6.8

GPU performance evaluation

In this section, you can see a comparison between benchmark tests with Tensorflow (using the Horodov framework) to know the intranode and multi-node performance using different GPU's.

We can see that when we use a single node there is an expected evolution as we increase the number of GPU's.

On one hand, the most interesting thing we can observe is that using 2 nodes and 2 GPU's per node, we achieve a better performance than having 4 GPU's on the same node. On the other hand, if we do the comparison between 2 nodes and one GPU per node and 2 GPU's with a single node, we get approximately the same result.

2 nodes and 2 gpu's per node

   335,4 img/s

1 node and 4 gpu's

279 img/s

1 node and 3 gpu's

240 img/s

1 node and 2 gpu's

180 img/s

2 nodes and a gpu per node

180 img/s



Note that running multi-node/multi-gpu jobs requires a huge number of GPUs available, and unfortunately this is not our case.

Sometimes configuring this kind of features on your jobs searching for the perfect performance could cause the opposite effect.: a high waiting time in the process queue.

As the cluster grows in GPU number, this feature should works better.