HPC High Performance Computing: 10.4. Running Singularity Containers Using GPU

Interactive use

If we want to run docker containers that use GPU's three conditions must be met:

1. - Reserve the GPU resources needed for our container to the queue manager

2. - Use an image that is compatible with the use of GPU's

To do that in an interactive session:

First of all, we have to run interactive command from the login node to be sent to a calculation node. For example, if we need two GPU's, we must specify the number of GPU's with "-g" parameter: 

test@ohpc:~$ salloc -g 2
salloc: Pending job allocation 459683
salloc: job 459683 queued and waiting for resources
salloc: job 459683 has been allocated resources
salloc: Granted job allocation 459683
test@node031:~$ module load CUDA/11.4.3

Then, we can run commands using the image:

test@node021:~$ singularity run --nv /soft/singularity/tensorflow_20.08-tf2-py3.sif python -c "import tensorflow as tf; print('Num GPUs Available: ',len(tf.config.experimental.list_physical_devices('GPU'))); print('Tensorflow version: ',tf.__version__)"

================
== TensorFlow ==
================

NVIDIA Release 20.08-tf2 (build 15413358)
TensorFlow Version 2.2.0

Container image Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
Copyright 2017-2020 The TensorFlow Authors.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

Detected MOFED .

2022-02-22 17:05:02.237591: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
2022-02-22 17:05:04.511546: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2022-02-22 17:05:04.547831: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:db:00.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2022-02-22 17:05:04.547887: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
2022-02-22 17:05:04.709196: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.11
2022-02-22 17:05:04.770208: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2022-02-22 17:05:04.840130: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2022-02-22 17:05:04.974522: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2022-02-22 17:05:05.027757: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.11
2022-02-22 17:05:05.029479: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2022-02-22 17:05:05.031163: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
Num GPUs Available:  1
Tensorflow version:  2.2.0

Sbatch use

In sbatch mode we have to prepare a script that request GPU resoruces (is the same than interactive mode but in this case specified in a script).

#!/bin/bash
#SBATCH -J tensorflow_sim
#SBATCH -p high
#SBATCH -N 1
#SBATCH --gres=gpu:1
#SBATCH --chdir=/homedtic/user/
#SBATCH -o slurm.%N.%J.%u.out # STDOUT
#SBATCH -e slurm.%N.%J.%u.err # STDERR

module load CUDA

singularity run --nv /soft/singularity/tensorflow_20.08-tf2-py3.sif python -c "import tensorflow as tf; print('Num GPUs Available: ',len(tf.config.experimental.list_physical_devices('GPU'))); print('Tensorflow version: ',tf.__version__)"

We run the command sbatch:

test@ohpc:$ sbatch slurm_singularity_tensorflow.sh
Submitted batch job 9144

and we will find the same results than interactive mode:

$ cat slurm.node022.9144.user.out
================
== TensorFlow ==
================
NVIDIA Release 20.08-tf2 (build 15413358)
TensorFlow Version 2.2.0
 
Container image Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
Copyright 2017-2020 The TensorFlow Authors.  All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
Detected MOFED .
Num GPUs Available:  1
Tensorflow version:  2.2.0