KPAR deadlock in electron_all.f90 — collective ordering mismatch between KPAR groups
Environment
- VASP version: [6.5.1]
- Build: AOCC 5.0.0, OpenMPI 5.0.8, AOCL (BLIS/libFLAME/ScaLAPACK/FFTW)
- Hardware: 2x AMD EPYC 9224, 32 MPI ranks, OMP_NUM_THREADS=2, KPAR=2
Problem
VASP hangs randomly mid-SCF with KPAR=2. All output stops, but all 32 MPI ranks stay alive at ~100% CPU (spin-wait). Not reproducible on demand — occurs at random iterations across different calculations.
GDB backtrace analysis
I attached GDB to all 32 ranks during a hang. The result is a clean 16/16 split — one KPAR group at a different point in the code than the other:
Group A (16 ranks): stuck in PMPI_Reduce at kpar_sync_celtot -> eddiag -> elmin_all (electron_all.f90:679)
Group B (16 ranks): stuck in PMPI_Allreduce at soft_charge -> set_charge -> elmin_all (electron_all.f90:743)
Group A is waiting for Group B to participate in the inter-KPAR Reduce inside eddiag. Group B has already moved past eddiag and is waiting for all 32 ranks in a global Allreduce inside soft_charge. Neither can proceed — circular deadlock.
This suggests one KPAR group can exit eddiag and reach the global collective in soft_charge before the other group finishes the inter-KPAR sync, if its eigensolver converges in fewer iterations.
Representative backtrace (one rank from each group)
Group A:
Code: Select all
PMPI_Reduce -> pmpi_reduce__ -> m_sum_master_d (mpi.f90:5847) -> m_sum_master_z (mpi.f90:5874)
-> kpar_sync_celtot (wave_mpi.f90:1469) -> subrot::eddiag (subrot.f90:829)
-> elmin_all (electron_all.f90:679) -> electronic_optimization (main.f90:5826)
Group B:
Code: Select all
PMPI_Allreduce -> pmpi_allreduce__ -> m_sumb_d (mpi.f90:2474)
-> charge::soft_charge (charge.f90:146) -> us::set_charge (us.f90:1211)
-> elmin_all (electron_all.f90:743) -> electronic_optimization (main.f90:5826)
Full backtrace of all 32 ranks attached along with all the calculations files. This is a huge system running an optimization of 160 atoms with ALGO=All and has plenty of memory to spare on the node. The hang has been seen earlier in different, much smaller systems too, happens at random and occurs more frequently for higher KPAR values.
Questions
1. Is the collective ordering between kpar_sync_celtot and soft_charge guaranteed to be consistent across KPAR groups?
2. Is this a known issue, and is KPAR=1 the only expected workaround?
Thank you,
Aditya Putatunda