CUDA (Compute Unified Device Architecture) was developed by NVIDIA a general purpose parallel computing architecture. It consists of CUDA Instruction Set Architecture (ISA) and parallel compute engine in the NVIDIA GPU (Graphics Processing Unit). The GPU has hundreds of cores that can collectively run thousands of computing threads. This capability complements the ability of a conventional CPU to run serial tasks by permitting the CPU to run the serial portions of an application, to handoff to the GPU parallel subtasks and to manage the complete set of tasks that make up the overall algorithm. Generally, in this model of computing, the best results are obtained my minimizing the communication between CPU (host) and the GPU (device).
In this section, we have submitted a basic job using the “gres” parameter, wich tells slurm that we want to reserve a gpu resource.
First of all, we have created the file that uses a gres parameter to reserve a gpu resource from the cluster.
#!/bin/bash #SBATCH -J prova_dani_uname10 #SBATCH -p short #SBATCH -N 1 #SBATCH --chdir=/home/test/gpu_maxwell #SBATCH --gres=gpu:1 #SBATCH --time=2:00 #SBATCH -o slurm.%N.%J.%u.out # STDOUT #SBATCH -e slurm.%N.%J.%u.err # STDERR module load CUDA/11.4.3 ./gpu_burn
If we execute a “scontrol show” of our job we can see in wich node it’s running and if is using a gpu resource:
test@login01:/home/test/gpu_maxwell# scontrol show job 945 JobId=945 JobName=prova_dani_uname10 UserId=root(0) GroupId=root(0) MCS_label=N/A Priority=4670 Nice=0 Account=root QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:07 TimeLimit=00:02:00 TimeMin=N/A SubmitTime=2017-12-11T21:04:37 EligibleTime=2017-12-11T21:04:37 StartTime=2017-12-11T21:04:38 EndTime=2017-12-11T21:06:38 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=short AllocNode:Sid=node020:10105 ReqNodeList=(null) ExcNodeList=(null) NodeList=node020 BatchHost=node020 NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:2:2 TRES=cpu=4,mem=4G,node=1,gres/gpu=1 Socks/Node=1 NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=1024M MinTmpDiskNode=0 Features=(null) Gres=gpu:1 Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/test/gpu_maxwell/gpu_maxwell.sh WorkDir=/home/test/gpu_maxwell StdErr=/home/test/gpu_maxwell/slurm.%N.%J.root.err StdIn=/dev/null StdOut=/home/test/gpu_maxwell/slurm.%N.%J.root.out Power=
If we go to the node020 and we execute an nvidia-smi to check if the gpu is running a process:
test@node020:/home/test/gpu_maxwell# nvidia-smi Mon Dec 11 21:07:23 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.90 Driver Version: 384.90 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX TIT... On | 00000000:41:00.0 Off | N/A | | 22% 47C P2 219W / 250W | 11001MiB / 12207MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 11078 C ./gpu_burn 10988MiB | +-----------------------------------------------------------------------------+
In the previous example we have executed a gpu test. Remember to make the module load cuda/8.0.61 to use the cuda software.
test@login01:/home/test/gpu_maxwell# cat slurm.node020.947.root.out GPU 0: GeForce GTX TITAN X (UUID: GPU-61bef67d-703c-f4da-60ad-04430a92f69e) Run length not specified in the command line. Burning for 10 secs 20.0% proc'd: 2711 errors: 0 temps: -- 30.0% proc'd: 5422 errors: 0 temps: -- 50.0% proc'd: 8133 errors: 0 temps: -- 60.0% proc'd: 10844 errors: 0 temps: -- 90.0% proc'd: 16266 errors: 0 temps: -- 100.0% proc'd: 21688 errors: 0 temps: -- Killing processes.. done Tested 1 GPUs: GPU 0: OK
We can be more specific when requesting a gpu resource because we can indicate the gpu type:
#!/bin/bash #SBATCH -J prova_dani_uname10 #SBATCH -p short #SBATCH -N 1 #SBATCH --chdir=/home/test/gpu_maxwell #SBATCH --gres=gpu:maxwell:1 #SBATCH --time=2:00 #SBATCH --sockets-per-node=1 #SBATCH --cores-per-socket=2 #SBATCH --threads-per-core=2 #SBATCH -o slurm.%N.%J.%u.out # STDOUT #SBATCH -e slurm.%N.%J.%u.err # STDERR module load CUDA/11.4.3 ./gpu_burn