As someone who worked for a while and still works in HPC, my impression from thi...

pama · 2026-04-17T11:07:25 1776424045

HPCs never loved the inefficiencies of anything virtualized (VMs or any containers really), so the shell hacks of module enabled a (limited, but workable) level of reproducibility that was sufficiently composable and usable by researchers who understood the shell. I am not going to defend this tcl hack any further, but I can see how it was the path of least resistance when people tried to stay close to the raw metal of their large clusters while keeping some level of sanity. Slurm is a more defensible choice, but I agree that these tools are from a different era of compute. I grew to love and hate these tools, but they definitely represent an acquired taste, like a dorian fruit; not like an apple.

Your centos6 references made me chuckle :-)

pphysch · 2026-04-17T15:22:16 1776439336

I promise you that the main reason HPC is behind on virtualization is not because of the little bit of overhead. There are a dozen other inefficiencies in the average HPC workload that are more significant.

Most centers don't even have good real-time observability systems to diagnose systemic inefficiencies, leaving application/workload profiling purely up to user-space.

The HP in HPC has really been watered down over the last couple decades, and "IT for computational research" would be a more accurate name. You can do genuinely high-performance computing there, but you'll be an outlier.

saltcured · 2026-04-17T22:41:18 1776465678

It's a mixture of legacy and reality.

For one, the assumption has been that you had dedicated use of all the nodes and communication network. It would kill your performance if your local node CPU scheduler was interfering with having your actual HPC program active when the messages were coming in from its peer tasks on the other nodes, since parallel jobs are limited in the end by the critical path latency of the cross-node communications.

It's only on the most "embarrassingly parallel" end of the spectrum where you can tolerate a bunch of virtualization and non-determinism, because the tasks communicate so infrequently or via such asynchronous mechanisms that they don't really impact the throughput of the whole job if they are asleep at random times.

But HPC systems also were very "unique". It wasn't just all Linux but a dozen different vendors' Unix variants with very different personalities. And for the bleeding-edge systems, each deployment was practically its own dialect of that vendor OS. Running a job was like cross-compiling to a one of a kind target. There was no generic platform where you could expect to build an app once and ship it around to whichever supercomputer was available.

pphysch · 2026-04-17T23:27:50 1776468470

Agreed on all points and this captures the history well.

zozbot234 · 2026-04-17T13:07:28 1776431248

Containers are an OS sandboxing/namespacing primitive, they don't involve any overhead on their own. The overhead is dependent on what's inside the container besides a single deployed binary.

pama · 2026-04-18T14:39:59 1776523199

What you way is true after the container starts. Typical HPC codes are tuned to raw hardware so they assume full ownership of the hardware anyways. When HPC was developing 30 years ago we didnt have clean ways to avoid overheads in the regime of 10k nodes. Instead we got parallel filesystems, caching, and shell, with module, which technically did the job for reproducible runs at a huge human cost.

anewhnaccount2 · 2026-04-17T10:29:47 1776421787

How should it be better? Most environments offer Apptainer which can import Docker containers. Plus a lot of theae languages like Julia and Chapel are pretty self contained and programmed against eg ancient libc for these very reasons.

sliken · 2026-04-17T16:21:12 1776442872

As you dig deeper I think you'll find a method behind the madness.

Sure modules just play with env variables. But it's easy to inspect (module show), easy to document "use modules load ...", allows admins to change the default when things improve/bug fixed, but also allows users to pin the version. It's very transparent, very discover-able, and very "stale". Research needs dictate that you can reproduce research from years past. It's much easier to look at your output file and see the exact version of compiler, MPI stack, libraries, and application than trying to dig into a container build file or similar. Not to mention it's crazy more efficient to look at a few lines of output than to keep the container around.

As for slurm, I find it quite useful. Your main complaint is no default systemd service files? Not like it's hard to setup systemd and dependencies. Slurms job is scheduling, which involves matching job requests for resources, deciding who to run, and where to run it. It does that well and runs jobs efficiently. Cgroup v2, pinning tasks to the CPU it needs, placing jobs on CPU closest to the GPU it's using, etc. When combined with PMIX2 it allows impressive launch speeds across large clusters. I guess if your biggest complaint is the systemd service files that's actually high praise. You did mention logging, I find it pretty good, you can increase the verbosity and focus on server (slurmctld) or client side (slurmd) and enable turning on just what you are interested, like say +backfill. I've gotten pretty deep into the weeds and basically everything slurm does can be logged, if you ask for it.

Sounds like you've used some poorly run clusters, I don't doubt it, but I wouldn't assume that's HPC in general. I've built HPC clusters and did not use the university's AD, specifically because it wasn't reliable enough. IMO a cluster should continue to schedule and run jobs, even if the uplink is down. Running a past EoL OS on an HPC cluster is definitely a sign that it's not run well and seems common when a heroic student ends up managing a cluster and then graduates leaving the cluster unmanaged. Sadly it's pretty common for IT to run a HPC cluster poorly, it's really a different set of contraints, thus the need for a HPC group.

Plenty of HPC clusters out there a happy to support the tools that helps their users get the most research done.

rirze · 2026-04-17T19:53:10 1776455590

It's been years since I last used `slurm`. Thanks for the blast from the past.

Mikhail_K · 2026-04-18T10:58:26 1776509906

> you realize they've been written for pre-systemd Linux

So still retaining some kind of sanity and good engineering practices?