test failures constrained NVE AMD EPYC (GCC 13.2 + OpenMPI 5.0.2 )

Message

lipsky · #1 Post by **lipsky** » Mon Feb 16, 2026 9:29 am

Dear VASP community,

I have recently compiled VASP 6.5.1 and would like to report some testsuite failures and request clarification on whether they represent genuine issues with our build or known reference incompatibilities.Ultimately I would like to know if this compilation is reliable regardless of the test errors.

HARDWARE

The cluster has two node types:
- 180× Lenovo SR645 (sr nodes): 2× AMD EPYC 7H12 (Zen 2, 64c/socket, 128c/node), 512GB RAM, HDR100 InfiniBand
- 34× Lenovo SR645 v3 (bc nodes): 2× AMD EPYC 9754 (Zen 4c Bergamo, 128c/socket, 256c/node), 768GB RAM, 2× HDR200 InfiniBand
OS: OpenSUSE 15.4

---

TOOLCHAIN

We use GCC 13.2 + OpenMPI 5.0.2 with the following key libraries:
- OpenBLAS 0.3.26 (runtime)
- MKL 2021.4 (linked as BLAS/LAPACK backend for ELPA compatibility)
- ScaLAPACK 2.2.0
- ELPA 2021.11.002 (OpenMP variant: -lelpa_openmp)
- FFTW 3.3.10 (with _threads variant)
- HDF5 1.14.4
- Wannier90 3.1.0

Key compilation flags:
-march=znver4 -mtune=znver4 -O3 -ffast-math -funroll-loops -fomit-frame-pointer -fopenmp

Two binaries were produced:
- znver4: 323,248 AVX-512 zmm instructions confirmed via objdump
- znver2: ~520 zmm instructions (MKL only, zero in VASP code)

Note: ELPA was compiled against MKL (LP64 interface). Using OpenBLAS directly alongside ELPA caused DGETRF integer interface errors, so MKL is used as the BLAS/LAPACK backend while OpenBLAS is loaded for runtime symbol resolution only.

---

RUNTIME CONFIGURATION

Hybrid MPI+OpenMP: 32 ranks × 8 OpenMP threads = 256 cores per bc node.
ELPA warning at runtime: "MPI threading level MPI_THREAD_SERIALIZED or MPI_THREAD_MULTIPLE required but your implementation does not support this. The number of OpenMP threads within ELPA will be limited to 1." We understand this is a compile-time limitation of OpenMPI on our cluster and accept it.

---

TESTSUITE RESULTS

The testsuite was run with LSCALAPACK=.FALSE. using 4 MPI ranks, OMP_NUM_THREADS=1.

Failed tests:

1. andersen_nve_constrain_fixed and andersen_nve_constrain_fixed_RPR
- andersen_nve PASSES
- andersen_nve_constrain_fixed FAILS
- Energy differences appear from step 3 onward, suggesting trajectory divergence
- RANDOM_SEED is explicitly set in INCAR (245175543 3381 0)
- Failure is identical on znver2 and znver4 binaries
- Failure persists after removing -ffast-math and recompiling

My Question: is this a known GCC vs Intel reference incompatibility for constrained NVE MD?

2. HEG_333_LW
- We suspect this fails due to LSCALAPACK=.FALSE. in our testsuite run since it is categorized as RPA/GW and likely requires ScaLAPACK's distributed eigensolver. Is this correct?

3. SiC8_GW0R
- This test appears to hang indefinitely on our AMD EPYC nodes. We have seen reports of this being an AMD-specific known issue. Can you confirm?

---

All other tests pass. Production calculations (LDA+U, relaxations, static SCF) produce physically reasonable results. We would greatly appreciate confirmation on whether the three failure categories above are known issues with GCC-compiled VASP on AMD hardware, or whether they indicate a genuine problem with our build.

Thank you,
Felipe

#2 Post by **andreas.singraber** » Mon Feb 23, 2026 11:57 am

Hello Felipe,

sorry for the late reply! It took some time to set up a similar toolchain on my side first and then perform various tests. Let me first clarify that while there are rare occasions where testsuite failures are understood and accepted, this should not be considered the typical case. I would refrain from considering a build "reliable" if it does not pass the testsuite!

Regarding the toolchain itself: unfortunately, I do not fully understand your mixture of OpenBLAS and MKL. Now, I am aware that one can probably build ELPA in such way that it always calls into MKL, even if VASP itself is linked with OpenBLAS. However, you mention that

OpenBLAS is loaded for runtime symbol resolution only

.. could you elaborate what you mean here and how you do this? If ELPA+MKL are linked to VASP+OpenBLAS I would expect that, e.g., a DGEMM call directly in VASP would be transferred to OpenBLAS while one in ELPA would call into MKL, right? Is this what you mean? Did you verify that this is actually happening at runtime, e.g., with ldd and LD_DEBUG=bindings?

Please attach the makefile.include you used for your build and in addition the testsuite.log files you obtained from running the testsuite!

So far I tried to get close to your suggested toolchain but omitted the "ELPA link to MKL" part. Hence I have set up various toolchains based on this selection (no MKL involved here):

Code: Select all

GCC 13.2.0
OpenMPI 5.0.2
OpenBLAS 0.3.26
ScaLAPACK 2.2.0
ELPA various versions  (OpenMP variant: -lelpa_openmp):
 - 2021.11.001
 - 2022.02.001
 - 2023.05.001
 - 2024.03.001
FFTW 3.3.10 (OpenMP variant: -lfftw_omp)
HDF5 1.14.4
Wannier90 3.1.0

The machine I am using is a AMD EPYC 7643 48-core processor, so unfortunately neither 2nd nor 4th generation Zen architecture like on your side. However, I tried to stay close to your compiler flags and used

Code: Select all

-march=znver3 -mtune=znver3 -O3 -ffast-math -funroll-loops -fomit-frame-pointer

Before discussing the individual results: I do not think you should ever set LSCALAPACK=.FALSE. during the test suite (via VASP_TESTSUITE_INCAR_PREPEND) because this will always result in both the HEG_333_LW and the SiC8_GW0R failing as they both require ScaLAPACK. As far as I know, LSCALAPACK does not switch between ScaLAPACK and ELPA, but rather enables/disables both of them! There is no INCAR tag to switch between the two, ELPA is automatically used if it is compiled in.

The baseline sets a build without ELPA: here on my side all tests you mentioned pass.

Moreover, with any of the listed versions of ELPA the tests pass, except for the HEG_333_LW which fails for two different reasons:

ELPA 2021.11.001 and 2022.02.001: the job hangs indefinitely

ELPA 2023.05.001 and 2024.03.001: the job crashes with

Code: Select all

free(): invalid pointer

Program received signal SIGABRT: Process abort signal.

At the moment I do not have an explanation for the hang-ups or the crashes, this requires further investigation. You mentioned reports of hang-ups on AMD hardware.. I am only aware of reports regarding Intel MPI on AMD hardware, not with OpenMPI. Can you give me links to the reports you mean?

What worries me most is the failing andersen tests on your side. I do not think that there are any differences in terms of ScaLAPACK/ELPA calls between the andersen_nve and andersen_nve_constrain_fixed tests so the reason for failing must be somewhere else (maybe the MKL/OpenBLAS mixture). I would be really helpful to look into the testsuite logs for this.

All the best,
Andreas Singraber

#3 Post by **andreas.singraber** » Mon Feb 23, 2026 8:25 pm

Hi again,

just a interesting update: I switched from the OpenMP version of ELPA to a non-threaded build of the library, i.e., now using

Code: Select all

ELPA various versions  (non-threaded variant: -lelpa):
 - 2021.11.001
 - 2024.03.001

and I had no more issues with the HEG_333_LW test! All tests you mentioned pass now! The ELPA manual suggests that anyway threading within ELPA itself would not happen if this warning occurs:

Code: Select all

ELPA warning at runtime: "MPI threading level MPI_THREAD_SERIALIZED or MPI_THREAD_MULTIPLE required but your implementation does not support this. The number of OpenMP threads within ELPA will be limited to 1."

However, (also according to the ELPA manual) if the linked MKL in your case supports threading then this can still be controlled via OMP_NUM_THREADS. So there is actually no reason to stick to the threaded version of ELPA if your OpenMPI does not support it. I would suggest to try again with a non-threaded ELPA build, maybe then also issues with other tests vanish?

All the best,
Andreas Singraber

VASP Forum

test failures constrained NVE AMD EPYC (GCC 13.2 + OpenMPI 5.0.2 )

test failures constrained NVE AMD EPYC (GCC 13.2 + OpenMPI 5.0.2 )

Re: test failures constrained NVE AMD EPYC (GCC 13.2 + OpenMPI 5.0.2 )

Re: test failures constrained NVE AMD EPYC (GCC 13.2 + OpenMPI 5.0.2 )