KPAR deadlock in electron_all.f90 — collective ordering mismatch between KPAR groups

Message

aditya_putatunda · #1 Post by **aditya_putatunda** » Thu Mar 05, 2026 2:09 pm

Environment
- VASP version: [6.5.1]
- Build: AOCC 5.0.0, OpenMPI 5.0.8, AOCL (BLIS/libFLAME/ScaLAPACK/FFTW)
- Hardware: 2x AMD EPYC 9224, 32 MPI ranks, OMP_NUM_THREADS=2, KPAR=2

Problem
VASP hangs randomly mid-SCF with KPAR=2. All output stops, but all 32 MPI ranks stay alive at ~100% CPU (spin-wait). Not reproducible on demand — occurs at random iterations across different calculations.

GDB backtrace analysis
I attached GDB to all 32 ranks during a hang. The result is a clean 16/16 split — one KPAR group at a different point in the code than the other:

Group A (16 ranks): stuck in PMPI_Reduce at kpar_sync_celtot -> eddiag -> elmin_all (electron_all.f90:679)

Group B (16 ranks): stuck in PMPI_Allreduce at soft_charge -> set_charge -> elmin_all (electron_all.f90:743)

Group A is waiting for Group B to participate in the inter-KPAR Reduce inside eddiag. Group B has already moved past eddiag and is waiting for all 32 ranks in a global Allreduce inside soft_charge. Neither can proceed — circular deadlock.

This suggests one KPAR group can exit eddiag and reach the global collective in soft_charge before the other group finishes the inter-KPAR sync, if its eigensolver converges in fewer iterations.

Representative backtrace (one rank from each group)

Group A:

Code: Select all

PMPI_Reduce -> pmpi_reduce__ -> m_sum_master_d (mpi.f90:5847) -> m_sum_master_z (mpi.f90:5874)
-> kpar_sync_celtot (wave_mpi.f90:1469) -> subrot::eddiag (subrot.f90:829)
-> elmin_all (electron_all.f90:679) -> electronic_optimization (main.f90:5826)

Group B:

Code: Select all

PMPI_Allreduce -> pmpi_allreduce__ -> m_sumb_d (mpi.f90:2474)
-> charge::soft_charge (charge.f90:146) -> us::set_charge (us.f90:1211)
-> elmin_all (electron_all.f90:743) -> electronic_optimization (main.f90:5826)

Full backtrace of all 32 ranks attached along with all the calculations files. This is a huge system running an optimization of 160 atoms with ALGO=All and has plenty of memory to spare on the node. The hang has been seen earlier in different, much smaller systems too, happens at random and occurs more frequently for higher KPAR values.

Questions
1. Is the collective ordering between kpar_sync_celtot and soft_charge guaranteed to be consistent across KPAR groups?
2. Is this a known issue, and is KPAR=1 the only expected workaround?

Thank you,
Aditya Putatunda

#2 Post by **ahampel** » Fri Mar 06, 2026 11:27 am

Dear Aditya,

thank you for reaching out to us on the official VASP forum. This is the 2nd of such reports within 2 weeks: https://vasp.at/forum/viewtopic.php?t=20572 , which is quite surprising to us. The release 6.5.1 is out for quite some time now and there is no reason why this comes up now.

Let me explain a little. Your traceback shows that one kpar group already signaled that the SCF electronic minimization is finished, i.e. electron_all.F:679:

Code: Select all

        IF (IROT==3 .OR. IROT==4 .OR. IROT==5 .OR. IROT==6 .OR. IROT==7 ) THEN
           IF (IO%IU0>=0) WRITE(IO%IU0,*)'final diagonalization occupied'
           IFLAG=23  ! rotate only in fully occupied many-fold
           CALL EDDIAG(HAMILTONIAN,GRID,LATT_CUR,NONLR_S,NONL_S,W,WDES,SYMM, &
                LMDIM,CDIJ,CQIJ, IFLAG,SV,T_INFO,P,IO%IU0,E%EXHF)

is inside an if statement INFO%LABORT that signals that the SCF is finished. Hence, this means that in the exact traceback you are showing one KPAR group or some decided that the abort criteria is reached, whereas the other went on and updated the charge to continue with the next SCF step. This is exactly what the above linked forum topic discusses. We fixed this now in the code by ensuring that all kpar groups (in fact all ranks) agree on the abort criteria. Somehow this problem never occurred before even though the KPAR feature is around for a long time. We are not yet quite sure why this surfaces now.

You are mentioning that you observed this problem also mid-SCF that would be more troublesome and our fix would not cover this. Do you have another set of tracebacks that demonstrates that?

We hope that our fix that will be part of the next release addresses this issue.

Best,
Alex

aditya_putatunda · #3 Post by **aditya_putatunda** » Fri Mar 06, 2026 3:45 pm

Dear Alex,

Thank you for the quick diagnosis and for linking the other report. Good to know this is already fixed for the next release.

One detail that may be relevant: in the run where I captured these backtraces, I had triggered the stop using LSTOP=.TRUE. in STOPCAR. So the LABORT flag was likely set by STOPCAR rather than by natural SCF convergence, but I assume the code path is the same.

Regarding mid-SCF hangs: I have experienced similar-looking hangs in earlier runs (processes alive at ~100% CPU, no output activity), but I did not collect GDB backtraces at the time, so I cannot confirm whether those were also at a convergence/abort boundary or genuinely mid-SCF. And if I remember correctly, maybe in earlier versions too? But I am not sure as they are quite rare. In any case, I will collect the full backtraces if it happens again after your fix is released.

Best regards,
Aditya

leszek_nowakowski · #4 Post by **leszek_nowakowski** » Sun Mar 15, 2026 7:10 pm

Hello,

I reported similar problem a year ago, in 6.4.3 version (https://www.vasp.at/forum/viewtopic.php?t=20137).

From changelog it seems like the problem was solved in 6.6.0 version, thanks for the good job!

Cheers,
Leszek

#5 Post by **ahampel** » Mon Mar 16, 2026 7:35 am

Dear Leszek, Dear Aditya,

yes the issue should be resolved - it just made it into the release. If you experience more problems of this sort let us know. Final note before I will close this topic: The fix merely makes sure there is consensus among the ranks if the calculation should be aborted. If there is another bug in a specific subroutine that leads to different energy terms in different kpar groups we would need to fix this as well. But for now we could not find any other problem that the pure numerical noise between groups.

Best,
Alex

leszek_nowakowski · #6 Post by **leszek_nowakowski** » Thu Mar 19, 2026 4:17 pm

Hello,

For the OpenACC port I get the similar hangs, always at the end of SCF loop. Some of the tasks are stucked in PMPI_Reduce and rest at cuStreamSynchronize().

Please see the attached GBD stacktraces, both for 6.6.0 and 6.5.1. The older versions was not tested, but as far as I remember the problem persists there as well.

Hardware: 4x NVIDIA GH200 in one node
Software: nvhpc-sdk/24.5, CUDA/12.4, OpenMPI/5.0.3

Best regards,
Leszek

#7 Post by **ahampel** » Mon Mar 23, 2026 8:56 am

Hi Leszek,

confirmed. I reopened the issue internally. We are working on a fix. This still affects ALGO=ALL|CONJUGATE|DAMPED . All other algos should be fixed. If not let me know please. Sorry for the inconvenience and thank you for testing - very much appreciated.

Best,
Alex

#8 Post by **ahampel** » Wed Apr 15, 2026 3:32 pm

The fix(es) are now in place on our master internally. If you like to fix this in your code manually you just have to add two lines in src/electron_all.F after line 691 and recompile the code:

Code: Select all


      CALL testNumberOfStep(N, NELM, INFO, LABORT_WITHOUT_CONV)

! Ensure all ranks take identical control flow decisions.
! add these two lines
      CALLMPI(M_and(WDES%COMM, INFO%LABORT, 1))
      CALLMPI(M_or(WDES%COMM, LABORT_WITHOUT_CONV, 1))

!      IF (INFO%LABORT.AND. &

This should solve the issue for ALGO=ALL. There were a few other spots that needed a fix. But this should be sufficient to mitigate your problem.

Best,
Alex

aditya_putatunda · #9 Post by **aditya_putatunda** » Fri Apr 24, 2026 8:34 am

Dear Alex,

Following up on the hangs I mentioned earlier: I have now captured a full backtrace of another. Same 16/16 collective-split signature as before, but the call sites are in electron.F (elmin), not electron_all.F.

Mapped from preprocessed .f90 line numbers back to the .F source:
electron.f90:802 -> electron.F:698
electron.f90:1016 -> electron.F:921

Backtrace (one representative rank per group)

Group A (16 ranks):

Code: Select all

PMPI_Reduce -> m_sum_master_d/z (mpi.F) -> kpar_sync_celtot (wave_mpi.F:1469)
            -> subrot::eddiag -> electron.F:698

Group B (16 ranks):

Code: Select all

PMPI_Allreduce -> m_sumb_d (mpi.F:2474) -> charge::soft_charge (charge.F:146)
               -> us::set_charge (us.F:1211) -> electron.F:921

Full gdb output of all 32 ranks attached.

Why this looks different from the electron_all.F case

Last time, you identified the race as an IROT-based abort wrapper around EDDIAG at electron_all.F:679. In electron.F, there is no such wrapper:

Code: Select all

$ grep -n IROT src/electron.F
160:      INTEGER N,ISP,ICONJU,IROT,ICEL,I,II,IRDMAA, &

Only the variable declaration at line 160 — no IF(IROT==3 .OR. ...) THEN block. So whatever mechanism allows one KPAR group to exit EDDIAG and advance to SOFT_CHARGE before the other must be different here, even though the end-state deadlock (inter-KPAR Reduce inside EDDIAG waiting on a group already in the global Allreduce inside SOFT_CHARGE) is identical.

Context

Build : VASP 6.5.1, AOCC 5.0.0, OpenMPI 5.0.8, AOCL
Hardware : 2x AMD EPYC 9224, 32 MPI ranks, OMP_NUM_THREADS=2
INCAR : KPAR=4, NPAR=4, ISIF=3, IBRION=2, ISPIN=2, ALGO=All
LDAU=.TRUE. (GGA+U on Eu-f and Ti-d), EDIFF=1E-7
System : EuTiO3, cell relaxation under ZBRENT

The hang triggered at DAV:1 of a new ionic step, immediately after ZBRENT interpolation and bond-charge prediction. The d eps on that step was -1.48E-8, sitting just below EDIFF=1E-7, which is plausibly a boundary where different KPAR groups disagree locally on convergence.

Question

Does your 6.6 fix sync the abort/convergence decision across all KPAR groups globally (which would cover both drivers), or is it scoped to the electron_all.F call site only? If the latter, electron.F likely needs the same treatment.

Best,
Aditya

#10 Post by **ahampel** » Fri Apr 24, 2026 12:19 pm

Dear Aditya,

This is now with VASP 6.5.1 correct? Here the break condition , i.e. info%labort is set in routine testBreakCondition in electron_common.F . In the 6.6.0 release the communication of the bool labort across KPAR groups is fixed / implemented. If you have this very concrete example can you try to add:

Code: Select all

...
        call testNumberOfStep(numStep, info%nElm, info, abortWithoutConv)

        ! add this
        CALLMPI(M_and(comm, info%lAbort, 1))

        ! and this
        CALLMPI(M_or(comm, abortWithoutConv, 1))
    end subroutine testBreakCondition

I hope the wrappers for booleans work the same on 6.5.1 . With this minimal change the hang / mismatch between KPAR groups you see should disappear. Or you can try 6.6.0 but likely the binary will produce slightly different numbers and the hang does not occur anyway.

Best,
Alex

aditya_putatunda · #11 Post by **aditya_putatunda** » Sat Apr 25, 2026 9:59 am

Dear Alex,

Thank you for your response. Yes, this was indeed for 6.5.1 as I don't have an immediate access to the latest version.

I tried both of the edits mentioned above in electron_common.F & electron_all.F but unfortunately, the build fails in both of them at different stages with the following error messages:

Code: Select all

F90-S-0038-Symbol, comm, has not been explicitly declared (electron_common.F)
  0 inform,   0 warnings,   1 severes, 0 fatal for testbreakcondition
make[2]: *** [makefile:195: electron_common.o] Error 1

--------------------

ld.lld: error: undefined symbol: m_or_
>>> referenced by electron_all.f90:664
>>>               electron_all.o:(elmin_all_)
flang: error: linker command failed with exit code 1 (use -v to see invocation)

Do you think there's a way to implement a quick fix for now, so as for me to finish up the optimization I was trying to achieve?

Best,
Aditya

#12 Post by **ahampel** » Sun Apr 26, 2026 8:30 am

Ah sorry this is my fault. Of course it was not sufficient to just insert these calls - I did not think this through. One has to make a few more surgical changes to pass comm to testBreakCondition and make M_or_ available. Here is a full patch file:
You can apply it via
patch -p1 < kpar_scf_fix_plain.patch

in the root dir of the extracted tar ball of VASP 6.5.1 . For me this applies cleanly against 6.5.1 and runs fine. Please try that.

Best,
Alex

kpar_scf_fix_plain.tar.gz

aditya_putatunda · #13 Post by **aditya_putatunda** » Wed Apr 29, 2026 6:37 am

Dear Alex,

Thank you for providing with the patch. There has been no such similar hang in relaxation once I had applied them.

Best,
Aditya

#14 Post by **ahampel** » Wed Apr 29, 2026 9:00 am

Dear Aditya,

very nice - thank you for reporting back. Let me know if you spot another problem similar.

Best,
Alex

VASP Forum

KPAR deadlock in electron_all.f90 — collective ordering mismatch between KPAR groups

KPAR deadlock in electron_all.f90 — collective ordering mismatch between KPAR groups

Re: KPAR deadlock in electron_all.f90 — collective ordering mismatch between KPAR groups

Re: KPAR deadlock in electron_all.f90 — collective ordering mismatch between KPAR groups

Re: KPAR deadlock in electron_all.f90 — collective ordering mismatch between KPAR groups

Re: KPAR deadlock in electron_all.f90 — collective ordering mismatch between KPAR groups

Re: KPAR deadlock in electron_all.f90 — collective ordering mismatch between KPAR groups

Re: KPAR deadlock in electron_all.f90 — collective ordering mismatch between KPAR groups

Re: KPAR deadlock in electron_all.f90 — collective ordering mismatch between KPAR groups

Re: KPAR deadlock in electron_all.f90 — collective ordering mismatch between KPAR groups

Re: KPAR deadlock in electron_all.f90 — collective ordering mismatch between KPAR groups

Re: KPAR deadlock in electron_all.f90 — collective ordering mismatch between KPAR groups

Re: KPAR deadlock in electron_all.f90 — collective ordering mismatch between KPAR groups

Re: KPAR deadlock in electron_all.f90 — collective ordering mismatch between KPAR groups

Re: KPAR deadlock in electron_all.f90 — collective ordering mismatch between KPAR groups