Tales from the HPC Support Trenches: Why is Gromacs Slower Now?

Greetings from the HPC Support trenches! Here’s your standard issue ssh credentials to various clusters, your How To Interact With Users Without Strangling Them Handbook, and your radio should you need to call for backup from the admin mages. Don’t worry, they’ll answer you. Sometimes. Till then, buck up, and start debugging!

So we had an old cluster that underwent some major upgrades. It was major enough that it warranted a full rebuild and update of all the modules we had installed. One of those modules was Gromacs, which was upgraded from 2020.6 to 2024.1.

Gromacs is a molecular dynamics simulation software package. For the purposes of this blog, all we need to know is that Gromacs is given an input file with initial conditions and after running for the designated number of steps in the simulation, will report a value ns/day which is the indicator of performance. It indicates how many nanoseconds of the MD simulation can be performed when run for a whole day (here’s a basic explanation). Obviously, higher the better.

A user opened a ticket saying that a Gromacs job that they ran before the big upgrade is now significantly slower. Like 10x slower going from approximately 30 ns/day to approximately 3 ns/day. They were using the Gromacs module that we had installed on the cluster in both cases. This was, obviously, a problem. But why though? By all accounts, we went to a newer version after the upgrade and surely the way the newer Gromacs was compiled on this system shouldn’t be so different that it would be 10x slower?

The reason, as I would find out after much time spent cursing life and computers in general, was a combination of things.

The first, most obvious problem was the job script. The compute nodes in the cluster were 32 core Intel Xeons. The cluster uses Slurm as its job scheduler. The old job script (that was used before the upgrade) set Slurm parameters -N 4 --ntasks-per-node 4 -c 8 and also set OMP_NUM_THREADS=8. This meant that Gromacs was started with 4 MPI tasks per node across 4 nodes (so 16 tasks total), and each task had 8 cores to use for OpenMP multithreading. So all 32 cores in each node (4 tasks x 8 cores per task) were being used for the simulation. The new job script set parameters -N 4 -n 4 -c 8 and set OMP_NUM_THREADS=8. Notice -n instead of --ntasks-per-node. -n is unfortunately not an alias for --ntasks-per-node. -n is an alias for --ntasks i.e. total number of MPI tasks across all the nodes, not tasks per node. This meant that there are only 4 MPI tasks running instead of 16, with 1 task on each node, and each task using 8 cores. So we were only using ¹⁄₄ of the hardware resources available to us. So of course it’s not surprising that it runs slower. This should’ve been blindingly obvious from the beginning and should’ve been the first thing I tried fixing before rerunning. But of course, I’m blindly stupid (or stupidly blind) and didn’t notice this, and decided to spend the next few days trying different builds of Gromacs to see if that would make it faster (it would, and we’ll talk about different Gromacs builds later). But the most obvious performance improvement was staring me in the face. So anyway, that’s three days of my life I won’t be getting back.

So once I fixed this in the job script, how did it perform?

Before (with installed Gromacs module):
                 (ns/day)    (hour/ns)
Performance:        3.055        7.855

After (changing -n to --ntasks-per-node, with the installed Gromacs module):
                 (ns/day)    (hour/ns)
Performance:        7.276        3.299

Wow, really? It’s an improvement but that’s not that much of an improvement. We’re using all 32 cores now on all 4 nodes instead of just 8 cores. So naively, you’d think there would be a 4x improvement from the initial bad performance. But that’s not what we see. We need to look further to see if there is something missing in the Gromacs build itself.

Which brings us to the second problem. To make a long story of much trials and tribulations short, the answer was SIMD (or the lack thereof). See, the Gromacs module we had installed in our software stack was compiled with SIMD disabled (which, when you are configuring Gromacs with Cmake, you can disable by passing the -DGMX_SIMD=None flag). If this flag is not set, CMake by default will detect the SIMD instruction set supported by the CPU architecture and pick the best option (see more on Gromacs SIMD support here). So I wanted to try and build Gromacs with SIMD not disabled (and let CMake autodetect the best SIMD option). I decided to try and build several variants of Gromacs 2024.5 to test with, figuring that was close enough to the module Gromacs 2024.1 version.

Besides SIMD, there were other build options that could be varied that might have an impact on performance. I decided to look at which library to use for FFT operations (controlled by passing -DGMX_FFT=fftw3 for using [FFTW]() or -DGMX_FFT=mkl for using [Intel MKL]() as the chosen FFT library). If choosing fftw3, Gromacs will build fftw3 as part of its build process (when setting -DGMX_BUILD_OWN_FFTW=ON which is what I did). If choosing mkl, you need to have the MKL library already installed. I also wanted to see what the performance would be like when compiling these variations with the GCC 12.2.0 compiler and the Intel OneAPI 2025.0.1 compiler (these were the compilers that were available).

So, in summary, I wanted to build and test four variations of Gromacs 2024.5 (which would be built with SIMD enabled and my job script would make sure all the cores are used when running): - GCC 12.2.0 + MKL 2025.0.1 - GCC 12.2.0 + FFTW - Intel 2025.0.1 + MKL 2025.0.1 - Intel 2025.0.1 + FFTW

Here’s the ns/day for each combination when I ran the test job for each. These were singular runs but still illustrative. Trying to get multiple runs takes FOREEEEEEVVVER when you have to submit a job to the cluster’s job scheduler and wait.

	MKL 2025.0.1	FFTW
GCC 12.2.0	18.284	21.454
Intel 2025.0.1	26.909	38.862

You can see things are significantly better. And Intel+FFTW is the best overall, even better than the reported best performance from before the cluster upgrade.

But why does Intel+FFTW perform so good? My best guess is that the Intel compiler is well optimized for the Intel Xeon CPUs that are in the system. From what I could tell from the make output, both builds use -O3 when building. It’s possible there might be other optimization flags that could be applied to make the GCC build go faster but nothing stood out to me in the compilation output as being wildly different between them. And also the Gromacs documentation mentions that FFTW is faster than MKL based on their tests, which matches what we see here in the performance difference between MKL and FFTW.

There are other avenues that I could still investigate: for example, even if I specify I want to use FFTW for FFT operations, Gromacs will still use MKL for Lapack and Blas operations if it is able to discover them at Cmake configuration time (for both Intel and GCC builds). It might be interesting to see if explicitly hiding MKL during configuration time (forcing Gromacs to use its own internal Lapack and Blas implementations) makes a huge difference in performance. I did try it but the performance doesn’t seem to be very different from the versions that did use MKL for Lapack and Blas (at least, not for this particular user’s input file). So I’m fine with leaving things at that for now. There are other users tickets to get to. This was enough information I could pass on to the folks who build the modules on the cluster, and also pass it on to the user so they could build Gromacs themselves if they didn’t want to wait for a new Gromacs module.

P.S. I want to give credit to this website that was a useful resource when I was testing all this out . I would still be in the middle of this if I wasn’t taught about the -nsteps flag to force the simulation to only run 10000 steps instead of the many hundreds of thousands the original input file specified.