A multi-node/multi GPUs job uses one or more GPUs from different nodes. This configuration is only available on Broadwell nodes (Intel processors), which are connected to the Infiniband network. Some of the softwares/libraries compatible with this technology are:
Before writing the script, it is essential to highlight that:
In this example, we have configured Hodorov framework:
First of all, we have jumped to one node with GPU with an interactive session:
test@login01:~$ salloc --gres=gpu:1
Then, we have downloaded the framework to our home directory:
test@node032:~$ git clone https://github.com/uber/horovod.git
After that, we have loaded the necessary libraries to install and configure the packages. Since Hodorov is a framework for Tensorflow, Keras or PyTorch, we have to load one of these modules to use it:
test@node032:~$ source /etc/profile.d/lmod.sh test@node032:~$ source /etc/profile.d/zz_hpcnow-arch.sh test@node032:~$ ml load torchvision/0.2.1-foss-2017a-Python-3.6.4-PyTorch-1.1.0 test@node032:~$ ml load Tensorflow-gpu/1.12.0-foss-2017a-Python-3.6.4 test@node032:~$ ml load NCCL/2.3.7-CUDA-9.0.176
We have installed the packages in local-user mode using "pip":
test@node032:~$ cd horovod/examples test@node032:/horovod/examples$ HOROVOD_CUDA_HOME=${CUDA_HOME} test@node032:/horovod/examples$ HOROVOD_NCCL_HOME=${EBROOTNCCL} test@node032:/horodov/examples$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 test@node032:/horodov/examples$ HOROVOD_WITH_PYTORCH=1 pip install --no-cache-dir --user horovod
We have configured a multi-node task with the following parameters:
#!/bin/bash # SCRIPT NAME: multiGPU.sh # Partition type #SBATCH --partition=high # Number of nodes #SBATCH --nodes=2 # Number of tasks. 2 (1 per node) #SBATCH --ntasks=2 # Number of tasks per node #SBATCH --tasks-per-node=1 # Memory per node. 5 GB (In total, 10 GB) #SBATCH --mem=5g # Number of GPUs per node #SBATCH --gres=gpu:1 # Select Intel nodes (with Infiniband) #SBATCH --constraint=intel # Modules source /etc/profile.d/lmod.sh source /etc/profile.d/easybuild.sh ml load torchvision/0.2.1-foss-2017a-Python-3.6.4-PyTorch-1.1.0 ml load Tensorflow-gpu/1.12.0-foss-2017a-Python-3.6.4 ml load NCCL/2.3.7-CUDA-9.0.176 # Discard Ethernet networks. We have 2 processes mpirun -np 2 --mca oob_tcp_if_exclude docker0,lo,eth0,eth1,eth2,eth3 -bind-to none -mca btl "openib,self,vader" --mca pml ob1 -x NCCL_DEBUG=INFO -x NCCL_P2P_DISABLE=1 python -u horovod/examples/tensorflow_synthetic_benchmark.py
Finally, we have run the job and we have verified that the task has reserved all resources:
test@node032:~$ sbatch multiGPU.sh Submitted batch job 824290
test@node032:~$ squeue | grep "test" 824290 high multiGPU test R 1:01 2 node[031-032]
test@node032:~$ scontrol show job 824290 JobId=824290 JobName=multiGPU.sh UserId=test(1243) GroupId=info_users(10376) MCS_label=N/A Priority=6000 Nice=0 Account=info QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:03:57 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2019-10-01T09:24:07 EligibleTime=2019-10-01T09:24: 07 StartTime=2019-10-01T09:24:08 EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=high AllocNode:Sid=node032:1247 ReqNodeList=(null) ExcNodeList=(null) NodeList=node[031-032] BatchHost=node031 NumNodes=2 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=2,mem=10G,node=2, gres/gpu=2 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=5G MinTmpDiskNode=0 Features=intel DelayBoot=00:00:00 Gres=gpu:1 Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/homedtic/test/ multiGPU.sh WorkDir=/homedtic/test StdErr=/homedtic/test/ slurm-824290.out StdIn=/dev/null StdOut=/homedtic/test/ slurm-824290.out Power=
In this case, it's running on two Intel nodes with GPU (node031 and node032) and 10 G (5G + 5G) of memory
Another way to monitor the execution is by connecting to the selected nodes and running “top” and “nvidia-smi” command. With "htop", we can see the resources being consumed in real time, "nvidia-smi" shows our processes running on GPUs.
But reading the output file is the best choice to check the results:
[...] node032:11589:11639 [0] NCCL INFO comm 0x2aff2c7f44a0 rank 1 nranks 2 - COMPLETEnode031:36278:36329 [0] NCCL INFO comm 0x2b46fc7b2fc0 rank 0 nranks 2 - COMPLETE node031:36278:36329 [0] NCCL INFO Launch mode Parallel Running benchmark... Iter #0: 136.4 img/sec per GPU Iter #1: 137.9 img/sec per GPU Iter #2: 138.1 img/sec per GPU Iter #3: 137.9 img/sec per GPU Iter #4: 133.8 img/sec per GPU Iter #5: 133.5 img/sec per GPU Iter #6: 137.2 img/sec per GPU Iter #7: 135.2 img/sec per GPU Iter #8: 138.8 img/sec per GPU Iter #9: 136.0 img/sec per GPU Img/sec per GPU: 136.5 +-3.4 Total img/sec on 2 GPU(s): 273.0 +-6.8
In this section, you can see a comparison between benchmark tests with Tensorflow (using the Horodov framework) to know the intranode and multi-node performance using different GPU's.
We can see that when we use a single node there is an expected evolution as we increase the number of GPU's.
On one hand, the most interesting thing we can observe is that using 2 nodes and 2 GPU's per node, we achieve a better performance than having 4 GPU's on the same node. On the other hand, if we do the comparison between 2 nodes and one GPU per node and 2 GPU's with a single node, we get approximately the same result.
2 nodes and 2 gpu's per node |
335,4 img/s |
1 node and 4 gpu's |
279 img/s |
1 node and 3 gpu's |
240 img/s |
1 node and 2 gpu's |
180 img/s |
2 nodes and a gpu per node |
180 img/s |
Note that running multi-node/multi-gpu jobs requires a huge number of GPUs available, and unfortunately this is not our case.
Sometimes configuring this kind of features on your jobs searching for the perfect performance could cause the opposite effect.: a high waiting time in the process queue.
As the cluster grows in GPU number, this feature should works better.