Linux HPC provides different kinds of HPC resources to its users. These mainly consist of different compute queues (i.e. SLURM partitions). Some of these partitions providing different hardware capabilities. In addition, there are two separate parallel filesystems. These are meant as a scratch space for performing application I/O when loading or generating data.
There are four partitions available in Linux HPC:
The batch partitions are older nodes with an iWARP-based low-latency Ethernet interconnect, while inf partitions have Infiniband.
Short partitions are meant for shorter, more interactive runs. We recommend that you use short partitions mostly for trying out your application and basic performance or scalability testing. Long partitions are meant for the heavier, longer-running jobs when you are confident that your application will work on a larger number of nodes, or will be stable enough to run for extended periods of time. However, that is just a guideline, and you may of course run on any partition. Especially if one of the partitions is filled up and the other is free, you may submit to the partition with free resources, as long as it fits the timelimit.
Note that nodes within each partition are homogeneous.
Batch partition nodes
- CPU: 2x Intel(R) Xeon(R) CPU E5-2650 v2 (16 physical cores, 32 hyperthreaded)
- Memory: 128GB DDR3 1600Mhz (8x16GiB M393B2G70QH0-YK0 DIMMs)
- Network: Chelsio T520-LL-CR (low-latency 10Gbit ethernet; iWARP)
- Hyperconverged CephFS for /hpscratch (Over 10GbE)
- 2TB HGST HUS724020AL for local scratch.
BE partition nodes
- CPU: 2x Intel(R) Xeon(R) CPU E5-2630 v3 (20 physical cores, 40 hyperthreaded)
- Memory: 128GB DDR4 2400Mhz (8x 16GiB 18ASF2G72PDZ-2G3B1 DIMMs)
- Infiniband interconnect, Mellanox MT27500 ConnectX-3
- Integrated Intel 10Gbit ethernet for storage interconnect, system services
- Hyperconverged CephFS for /hpscratch (over 10GbE)
- 960GB Intel S3520 SATA3 for local scratch
I/O Scratch spaces
I/O scratch spaces for project data are all based on CephFS. The main scratch space is
/hpcscratch, which is where user home directories and project directories are located. Parallel programs are expected to perform I/O on this space as no tokens are required.
The home and scratch space area /hpcscratch is a Hyperconverged CephFS cluster. This means that the compute nodes of the inf partitions are also working as the storage nodes for /hpcscratch data. This has two immediate consequences: First, I/O access is faster as the data is closer (especially compared to the "old" /hpcscratch), and the services is more resilient to datacenter network incidents. Second, if you are measuring the performance of highly CPU-optimized codes and run at 100% utilization, you may observe some noise or performance variability. This is due to the fact that another user job doing intensive I/O may cause a CephFS process co-located with your job to compete for CPU. The same applies to I/O performance. For most users this will not be noticeable, but it may become visible.
Note that while running applications installed on AFS or EOS is fully supported, writing program outputs directly to AFS or EOS is not supported. Users are expected to transfer result files from the scratch space to EOS.
If your application does local caching or I/O on each worker node, it is recommended to use a local disk like /tmp for such I/O, and only get results and snapshots back to the shared scratch file.
IMPORTANT: Please note that while the scratch space provides redundancy to prevent data loss, there are NO BACKUPS.