Why HPC and Cloud-Native Don't Live in Isolation Anymore
Modern AI training workloads sit at an uncomfortable intersection. On one hand, HPC clusters with Slurm offer tightly coupled bare-metal scheduling, NUMA-aware placement, MPI job management, and decades of operational maturity. On the other hand, the Kubernetes ecosystem delivers container portability, GPU operator automation, operator-based distributed training (Kubeflow, PyTorchJob), and a vibrant MLOps toolchain.
The trap organizations fall into is treating these as competing philosophies. The real opportunity is to use them together — Slurm owns the outer job lifecycle, resource reservation, and policy enforcement; Kubernetes owns the inner workload containerization and AI framework orchestration. The result? You get the scheduling determinism of HPC with the flexibility of cloud-native AI stacks.
The Stack at a Glance
| Layer | Technology | Role | Why It's Here |
|---|---|---|---|
| Job Scheduler | Slurm | Outer orchestration | HPC Native |
| Container Orchestration | Kubernetes | Workload runtime | Cloud-Native |
| AI Training Framework | PyTorchJob / Kubeflow | Distributed training CRD | Ecosystem |
| Network Fabric | RoCEv2 (RDMA over Converged Ethernet) | GPU-to-GPU data path | Ultra-low latency |
| Parallel Filesystem | IBM Spectrum Scale (GPFS) | Shared model & dataset storage | POSIX Compliant |
| GPU Plugin | NVIDIA GPU Operator | Device management in K8s | Automation |
Cluster Topology & Hardware Configuration
Before touching a single YAML file, the physical and logical topology must be right. Here's a representative production cluster design:
Node Roles
Login / Head Nodes (2×)
Slurm control daemon (slurmctld) + Kubernetes API server co-located. Users submit jobs here. These nodes see no GPU workloads directly.
Compute Nodes — GPU Workers (N×)
Slurm worker (slurmd) + Kubernetes worker (kubelet) running simultaneously. Each node exposes 8× NVIDIA H100 or AMD MI300X GPUs. Connected via RoCEv2 NICs (100GbE / 400GbE).
Storage Nodes — IBM Spectrum Scale
Dedicated NSD (Network Shared Disk) servers running GPFS daemons. Exposed over InfiniBand or high-speed Ethernet to all compute nodes. Mounted at /gpfs/projects.
Network Infrastructure
Spine-leaf Ethernet topology with DCQCN (Data Center Quantized Congestion Notification) enabled for RoCEv2 lossless fabric. Priority Flow Control (PFC) on GPU traffic class.
Building the Ultra-Low Latency GPU Interconnect Fabric
RoCEv2 (RDMA over Converged Ethernet v2) is the linchpin of multi-node GPU training performance. It enables GPU memory to be read/written directly across the network, bypassing the CPU — cutting latency to microseconds and enabling NCCL collective operations (AllReduce, AllGather) to approach InfiniBand speeds over standard Ethernet infrastructure.
Switch Configuration — Enabling Lossless Ethernet
Host-Side NIC Configuration (Mellanox / ConnectX)
NCCL Environment Variables for RoCEv2
ib_write_bw, ib_read_lat) before running any training job. A misconfigured PFC will cause head-of-line blocking that silently degrades AllReduce throughput by 40–60%.
Mounting GPFS as the Shared Filesystem Across Slurm & Kubernetes
IBM Spectrum Scale (formerly GPFS) provides a POSIX-compliant parallel filesystem capable of serving thousands of clients at multi-terabyte-per-second aggregate throughput. For AI training, this means checkpoint writes, dataset reads, and model artifact storage all flow through a single coherent namespace shared between Slurm compute nodes and Kubernetes pods.
Spectrum Scale on Compute Nodes
Exposing Spectrum Scale to Kubernetes Pods via PVC
The cleanest integration path is the IBM Spectrum Scale CSI driver, which provisions PersistentVolumes backed directly by GPFS filesets. Pods see a standard PVC mount at any path, but the underlying I/O hits the parallel filesystem natively.
ReadWriteMany access mode for all training PVCs. This is Spectrum Scale's strength — multiple pods on different nodes can write checkpoints and read sharded datasets simultaneously without coordination overhead. InfiniBand or high-speed Ethernet between NSD servers and compute nodes is mandatory at this scale.
The Bridge: Making Slurm Launch Kubernetes Workloads
This is the core architectural challenge. The solution relies on Slurm prolog/epilog scripts that orchestrate the Kubernetes API — specifically creating and destroying PyTorchJob CRDs (via Kubeflow Training Operator) when Slurm allocates and releases nodes.
The Prolog/Epilog Pattern
The Slurm Job Script — Launching a PyTorchJob
GPU Operator, Device Plugins & RDMA-Aware Pod Scheduling
The Kubernetes GPU Operator automates the entire GPU software stack — drivers, container toolkit, device plugin, and DCGM exporter. But for RoCEv2-backed NCCL training, we also need RDMA device exposure inside pods via the RDMA Shared Device Plugin.
Topology-Aware Pod Placement
For best GPU-to-GPU RDMA performance, pods must land on nodes where the GPU and the RoCEv2 NIC share the same PCIe root complex (same NUMA socket). Use Topology Manager and CPU Manager in Kubernetes to enforce this.
What This Architecture Delivers in Practice
After running this stack in production across several large-scale LLM pre-training runs, here are representative numbers from an 8-node × 8× H100 SXM5 cluster:
What to Watch Out For in Production
Checkpoint Strategy on Spectrum Scale
Configure PyTorch DCP (Distributed Checkpoint) to write to /gpfs/projects/checkpoints/{job_id}/. Spectrum Scale handles concurrent multi-writer checkpoints efficiently via its metadata caching architecture. Set mmchconfig maxFilesToCache=8192000 on NSD servers for large training jobs that open thousands of files concurrently.
Slurm Resource Binding vs Kubernetes Requests
A common pitfall: Slurm's --gres=gpu:8 reserves the GPU at the hypervisor level (cgroups), but Kubernetes GPU requests must match exactly. If Kubernetes allocates 7 GPUs on a node that Slurm reserved 8 for, the 8th GPU is orphaned. Enforce matching by templating the PyTorchJob manifest from Slurm environment variables as shown in Section 05.
RoCEv2 Packet Loss Detection
NCCL will hang silently on even 0.01% packet loss if ECN/PFC is misconfigured. Install DCGM and monitor DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL alongside switch counters. Set up alerting on cnp_sent (Congestion Notification Packets) — a spike here means your fabric is congested and training throughput will degrade.
IBM Spectrum Scale Quota Management
Use GPFS filesets and quotas per project namespace: mmcrfileset gpfs0 project-nlp --inode-space new then link and set quota. This prevents any single training job from consuming the entire filesystem and creates natural accounting boundaries aligned with your Slurm partitions and Kubernetes namespaces.
get/list/update nodes, create/delete/get pytorchjobs, delete pods in the ai-training namespace only. Never use admin kubeconfig in Slurm scripts.
The Future is Hybrid, Not Either/Or
The instinct to standardize on one orchestrator — either Slurm or Kubernetes — is understandable but increasingly counterproductive. The HPC world has decades of scheduler maturity, accounting rigor, and bare-metal performance that cloud-native tooling is only beginning to replicate. Conversely, the AI ecosystem — PyTorch, Kubeflow, Ray, Volcano, MLflow — is built natively for Kubernetes.
The architecture described here is not a hack or a workaround. It is an intentional composition of two scheduling layers, each operating at the level of abstraction it's best suited for. Slurm thinks in nodes and partitions. Kubernetes thinks in pods and operators. RoCEv2 speaks in microseconds. Spectrum Scale speaks in petabytes. Put them together with care, and you get an AI training platform that matches or exceeds what the large cloud providers offer — on your own hardware, under your own SLAs.
The teams building foundation models at national labs, automotive OEMs, and financial institutions are already running this stack. The tooling has matured. The patterns are proven. The only remaining barrier is organizational — and that's a different blog post entirely.
Found this useful?
If you're working on HPC + Kubernetes integration, GPU cluster optimization, or distributed AI training infrastructure, let's connect and exchange notes.
Connect on LinkedIn →