KPAR-related bug in electrostatic_energy()
The global summation of the energy in electrostatic_energy (pot_electrostatic.F, line 217) is not quite right. It only sums within each K group (COMM_KIN), but this means that different groups can end up with _slightly_ different values, at the level of roundoff (because MPI sums are not deterministic because floating point addition is not associative). The differences are tiny, of course, but this energy is eventually passed back to ELMIN as E%DENC and contributes to TOTEN (elmin() electron.F line 640). This value is then used to compute DESUM, which is then used to detect convergence in testBreakCondition(). If it happens that in some of the k-groups the value is just above a threshold and in the others it is just below, the MPI tasks don't all agree on whether the loop should be exited, and they take different paths through the IF in electron.F line 777. As a result, the whole code hangs because not all the MPI tasks make the same collective communication calls (some get stuck at the call to KPAR_SYNC_CELTOT, electron.F line 1016, not sure exactly where the others get stuck).
Unfortunately I can't really provide a reproducing example of the hang, because this depends on tiny roundoff issues that depend on everything, including machine architecture, MPI library, number of cores, etc. I only discovered this after jobs started hanging on a couple of machines, but it wasn't even the same job on the different machines. What you can probably reproduce, if it's of interest, is that when KPAR is active E%DENC and TOTEN aren't identical on all the MPI tasks (they differ at about machine precision). Whether that happens to trigger the difference in testBreakCondition is the part that's very sensitive, because it only happens if the difference happens to line up with the convergence tolerance.
I think the following patch can fix this issue, although there may be a cleaner way. It just forces a mean over GRIDC%COMM_KINTER, which ensures that all the MPI tasks agree on the value, and therefore choose to exit the SCF loop at the same time. I hope that a fix for this issue can be incorporated into the next VASP release.
--- ../../../../canonical/vasp.6.5.1/src/pot_electrostat.F 2025-08-21 16:24:41.000000000 -0500
+++ pot_electrostat.F 2026-02-20 13:05:56.000000000 -0600
@@ -215,6 +215,10 @@
!
energy = -energy / 2._q
CALLMPI( M_sum_d(gridc%comm, energy, 1))
+ ! make sure that values are identical across all KPAR
+ ! groups, so convergence decision will be consistent
+ CALLMPI( M_sum_d(gridc%comm_kinter, energy, 1))
+ energy = energy / gridc%comm_kinter%ncpu
end subroutine
subroutine iteration_information(iterating_over_ions, &