Rocket Cluster

About the Cluster

The main part of the Rocket cluster consists of:

  • 135 compute nodes (called stage1 – stage135)
  • six compute nodes with GPUs (falcon 1 to 6)
  • 20 high density AMD CPU nodes (called ares 1 to 20)
  • four high memory machines (called bfr 1 to 4)
  • 12 CPU nodes (called sfr 1 to 12)
  • a headnode (rocket.hpc.ut.ee).

In addition to these nodes, there are a few GPFS filesystem servers which will provide fast storage for the entire cluster.

All the machines mentioned above are connected to a fast Infiniband fabric, powered by Mellanox switches. The connectivity of the entire Infiniband fabric can be seen from this little visualization.

In addition to Infiniband, all aforementioned machines are also connected to a regular ethernet network for easier access. Machines are connected together with 1/10/25/40 Gbit/s Ethernet in order to provide fast access from these machines to outside of the cluster network, to the University central network and beyond, depending on necessity.

All nodes in the Rocket cluster are running the latest CentOS 7.

You can submit your computations to the cluster using SLURM.

Read more about using SLURM

Read more about different limits in place on the Rocket cluster

Hardware

135 nodes, stage1 – stage135 (HP ProLiant SL230s Gen8)

  • 2 x Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz (20 cores total)
  • 64GB RAM
  • 1TB HDD (~860GB usable)
  • 4x QDR Infiniband

4 big memory nodes, BFR1-4 (Lenovo ThinkSystem SR630)

  • 2x Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz (40 cores total)
  • 1 TB RAM
  • 8TB of fast SSD temporary space
  • FDR Infiniband, clocked down to 4x QDR for cluster cohesion

20 high density nodes, Ares 1-20

  • 2x AMD EPYC 7702 64-Core Processor (128 cores total)
  • 1 TB RAM
  • 8TB of fast SSD temporary space
  • HDR infiniband @ 100 Gbps

12 CPU nodes, SFR1-12 (Lenovo ThinkSystem SR630)

  • 2x Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz (40 cores total)
  • 256 GB RAM
  • 8TB of fast SSD temporary space
  • FDR Infiniband, clocked down to 4x QDR for cluster cohesion

6 GPU nodes, falcon1-6, purchase funded by Institute of Computer Sciences:

  • 2 x Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz (48 cores total)
  • 512 GB RAM
  • 5TB of local SSD storage
  • Infiniband:
    • Falcon 1-3 – 2x 40 Gbps each
    • Falcon 4-6 – 5x 100 Gbps each
  • 24x NVIDIA Tesla V100 GPUs:
    • Falcon3 versions have 16 GB of VRAM.
    • Falcon 4-6 versions have 32 GB of VRAM.

2 GPU nodes with Tesla a100 GPUS:

pegasus.hpc.ut.ee

  • 2 x AMD EPYC 7642 48-Core Processors (192 cores total)
  • 512 GB RAM
  • 1.6TB of local SSD storage
  • 7 x Tesla a100 with 40GB vRAM each
  • Infiniband:
    • 1x 200Gb connection

pegasus2.hpc.ut.ee

  • 2x AMD EPYC 7713 64-Core Processors (256 cores total)
  • 2 TB RAM
  • 15TB of local SSD storage
  • 8 x Tesla a100 with 80GB vRAM each
  • Infiniband:
    • 9 x 100Gb connections

1 GPU node with Tesla a100 GPUS:

  • 2 x AMD EPYC 7642 48-Core Processors (192 cores total)
  • 512 GB RAM
  • 1.6TB of local SSD storage
  • 7 x Tesla a100 with 40GB vRAM each
  • Infiniband:
    • 1x 200Gb connection

For details on the large memory machine, please visit the Atlas page.

Storage

The following storage branches are mounted to all machines in the Rocket cluster:

  • /gpfs/rocket – 1.4 PB – Specialised disk based storage for data with automatic tape library migration.
  • /gpfs/hpc – 1.6 PB – Declustered RAID based GPFS specific high performance disk storage with a transparent Flash tier.
  • /gpfs/space – 5.6 PB – Declustered RAID based GPFS specific very high performance disk storage with a transparent Flash tier.