HPC High Performance Computing: Systems Overview

Resources overview

Summary

The cluster has many main resources, see next totals list:

  • Compute Nodes: 28
  • CPU Architectures: 1
  • CPU Cores: ~1.400
  • Amount of Memory: ~4,5TB
  • GPUs: 40
  • CUDA Cores: 129.024
  • Tensor Cores: 9.856
  • Filesystem Size: ~ 400TB

Cluster diagram

Hardware Resources

  • The system has one main computational architectures of CPU, Intel.

  • Parallel filesystem storage Beegfs, parallel cluster file system with a strong focus on performance and designed for intensive workloads and I/O.
  • A bundle of Graphics Processor Units of NVIDIA.

Storage 

  • Manufacturer: Lenovo S3200
  • Expansion self: E1024 SFF i E1012 LFF
  • Connectivity: Infiniband
  • Data:
    • 57 x Disk Size: NL-SAS HDD 3.5" 10TB 7.2k
    • 11 x Disk Size: 2.5" 400GB SSD
  • Metadata
    • 46 x Disk Size: 2.5" 800GB SSD
  • Controllers (2 servers):
    • CPU: 2x Intel Xeon Processor E5-2630 v4
    • RAM: 128 GB
    • Disk: 900GB 10K 12Gbps SAS 2.5” G3HS HDD
    • Connectivity: 
      • 2 x 10G (Broadcom NetXtreme Dual Port 10GbE SFP)
      • 1 x 40Gbps (Mellanox ConnectX-3 Pro ML2 2x40GbE/FDR VPI)
    • Filesystem type: BeeGFS
    • NAS Services: NFS, Samba

Computational Servers

Nodes 001 to 018

  • Manufacturer: Lenovo nx360 M5 (1/2U)
  • CPU: 2x Intel Xeon E5-2650 v4, 12 cores, 2.2GHz 30MB Cache 2400MHz 105W
  • RAM: 96GB
  • Disk: 240GB SSD
  • Connectivity: 56 Gbps (Infiniband Mellanox ConnectX-3 Pro ML2 2x40GbE/FDR VPI Adapter)

Nodes 019 to 022

  • Manufacturer: SuperMicro (2U)
  • CPU: 2 x Intel Xeon Cascade Lake-SP 4214 a 2,2Ghz (3,2 en mode turbo), 12 cores
  • RAM: 192GB DDR4
  • Disk: 240GB SDD - 6Gb/s
  • Connectivity: 56 Gbps (Infiniband Mellanox ConnectX-3 Pro ML2 2x40GbE/FDR VPI Adapter)

Node 023

  • Manufacturer: SuperMicro (2U)
  • CPU: 2 x Intel Xeon Silver 4210R a 2,4Ghz (3,2 en mode turbo), 10 cores
  • RAM: 223GB DDR4
  • Disk: 240GB SSD
  • Connectivity: 56 Gbps (Infiniband Mellanox ConnectX-3 Pro ML2 2x40GbE/FDR VPI Adapter)

Nodes 024 to 026

  • Manufacturer: DELL (2U)
  • CPU: 2 x Intel Xeon Silver 4216 a 2,1Ghz (3,2 en mode turbo), 16 cores
  • RAM: 770 / 642 GB DDR4
  • Disk: 446GB SSD
  • Connectivity: 56 Gbps (Infiniband Mellanox ConnectX-3)

Nodes 031 to 032

  • Manufacturer: GigaByte (2U)
  • CPU: 2 x Xeon Broadwell E5-2690 V4 de 2,6 GHz 14 cores
  • RAM: 128 / 256 GB DDR4/2666 Mhz ECC Reg.
  • Disk: 512 GB SSD Micron M1100, R/W 92.000 / 83.000 IOPS SATA interface of 6 Gb/s
  • Connectivity: 56 Gbps (Infiniband Mellanox ConnectX-3)

Networking

Low latency network

  • Manufacturer: Mellanox SX6036 (1U)
  • Ports: 36 x 56 Gbps (FDR ports)

Management network

  • Manufacturer: Lenovo RackSwitch G8052 (1U)
  • Ports: 48 x 1 Gbps (RJ-45 ports)

GPU Accelerated Computing

Graphics Processor Unit

  • Manufacturer: NVIDIA
  • Models:
    • 6 x Quadro RTX 6000 (Turing), 24 GB
    • 4 x Zotac GEFORCE GTX 1080 Ti (Pascal), 12,80GB
    • 10 x Gigabyte GEFORCE GTX 1080 Ti TURBO (Pascal), 11,72GB
    • 20 x Tesla T4 (Turing), 15,84GB
  • Architectures:
    • Pascal / Turing
  • Total GPU's: 40
  • Total GPU Cores: 129.024

Limits

Fair share of Resources

The goal is to ensure that the access to the resources is fairly shared between all users, and no user can monopolyze cluster usage and make the rest of the user's jobs wait for a given time. To do so, the scheduler calculates quotas and applies limits depending on several things. There are:
  • Limits for user:  a single user can run simultaneously a given number of jobs
  • Quota limit: every job uses resources, and resources consume quota. Once the user's quota reaches zero the user cannot run the jobs anymore. Users with higher quota have higher priority when running jobs, as less quota implies prior resource usage. Quota calculation is performed every two weeks, and prior usage on the cluster counts. See section 'Queues' for more information.

A user can allocate simultaneously up to:

  • Number of CPU's: 300 
  • Amount of RAM Memory: 512 GB
  • Number of GPU's: 5