Page 1 of 1

KPAR-related bug in electrostatic_energy()

Posted: Fri Feb 20, 2026 7:12 pm
by bernstei

The global summation of the energy in electrostatic_energy (pot_electrostatic.F, line 217) is not quite right. It only sums within each K group (COMM_KIN), but this means that different groups can end up with _slightly_ different values, at the level of roundoff (because MPI sums are not deterministic because floating point addition is not associative). The differences are tiny, of course, but this energy is eventually passed back to ELMIN as E%DENC and contributes to TOTEN (elmin() electron.F line 640). This value is then used to compute DESUM, which is then used to detect convergence in testBreakCondition(). If it happens that in some of the k-groups the value is just above a threshold and in the others it is just below, the MPI tasks don't all agree on whether the loop should be exited, and they take different paths through the IF in electron.F line 777. As a result, the whole code hangs because not all the MPI tasks make the same collective communication calls (some get stuck at the call to KPAR_SYNC_CELTOT, electron.F line 1016, not sure exactly where the others get stuck).

Unfortunately I can't really provide a reproducing example of the hang, because this depends on tiny roundoff issues that depend on everything, including machine architecture, MPI library, number of cores, etc. I only discovered this after jobs started hanging on a couple of machines, but it wasn't even the same job on the different machines. What you can probably reproduce, if it's of interest, is that when KPAR is active E%DENC and TOTEN aren't identical on all the MPI tasks (they differ at about machine precision). Whether that happens to trigger the difference in testBreakCondition is the part that's very sensitive, because it only happens if the difference happens to line up with the convergence tolerance.

I think the following patch can fix this issue, although there may be a cleaner way. It just forces a mean over GRIDC%COMM_KINTER, which ensures that all the MPI tasks agree on the value, and therefore choose to exit the SCF loop at the same time. I hope that a fix for this issue can be incorporated into the next VASP release.

--- ../../../../canonical/vasp.6.5.1/src/pot_electrostat.F 2025-08-21 16:24:41.000000000 -0500
+++ pot_electrostat.F 2026-02-20 13:05:56.000000000 -0600
@@ -215,6 +215,10 @@
!
energy = -energy / 2._q
CALLMPI( M_sum_d(gridc%comm, energy, 1))
+ ! make sure that values are identical across all KPAR
+ ! groups, so convergence decision will be consistent
+ CALLMPI( M_sum_d(gridc%comm_kinter, energy, 1))
+ energy = energy / gridc%comm_kinter%ncpu
end subroutine

subroutine iteration_information(iterating_over_ions, &


Re: KPAR-related bug in electrostatic_energy()

Posted: Tue Feb 24, 2026 2:37 pm
by ahampel

Dear @bernstei,

thank you for bringing this to our attention. This is indeed a problematic behavior. I created an internal bug report #1524 and we will work on a fix and include it as soon as possible in a release version. I do believe that your fix will produce the correct behavior. So go ahead and use this for now. Note that in src/ebs.F similar problematic behavior is present. I will report back here once a fix is implemented.

Best regards,
Alex


Re: KPAR-related bug in electrostatic_energy()

Posted: Tue Feb 24, 2026 2:49 pm
by bernstei

Thanks.


Re: KPAR-related bug in electrostatic_energy()

Posted: Wed Feb 25, 2026 12:42 pm
by ahampel

Hi,
I discussed this a little internally and we are a bit surprised that the differences in each kpar group can be so large that one group can abort already. In the routines you mention all groups should obtain the same input charge density from which a potential is created. If the output differs so much we wonder if maybe there is a problem in the charge density input for each group (those should be the same upon entering). Do you have any numbers from a run that crashed? We wonder if it is just algebra precision going from charge density to potential causing this or if this is a problem of the input.

Now we could of course average as you described again over comm_kinter, but we want to make sure we are fixing the right spot. If the charge density input in each k-group is different we should fix this somewhere else.

Best,
Alex


Re: KPAR-related bug in electrostatic_energy()

Posted: Wed Feb 25, 2026 1:56 pm
by bernstei

"we are a bit surprised that the differences in each kpar group can be so large that one group can abort already"

Keep in mind that _any_ difference, however small, in the energy can lead to this behavior. Here is a toy example: suppose your convergence threshold is 0.00001. Now, suppose that E%DENC is 0.245000000001 for one subset of MPI tasks, and 0.245000000002 in the other, the difference is only of order 1e-12 (or however many digits I put in). This value contributes to the sum that is used to calculate TOTEN, and the difference between this TOTEN and the previous one is compared to the SCF convergence tolerance. Suppose the difference on some MPI tasks is .0000100000005, and those tasks are above the tolerance so they do not think they are converged. On others the difference is .0000099999995 and those think they _are_ converged.

And, it doesn't matter how small the difference is - if it leads to a floating point comparison, it can lead to difference outcomes (greater than vs. less than), and therefore different conclusions about whether it's converged. And you can't fix it by rounding, or truncating, or anything like that - any such operation will have a discontinuity, and any difference in the input could in principle end up on different sides of the decision boundary.

This kind of difference (i.e. at machine precision) can be an inherent part of the way MPI does gather/allgather summation, depending on the algorithm. Specifically, when the _order_ the numbers from the different MPI tasks isn't deterministic, which is often isn't in common MPI implementations, the result can vary (at least the level of machine precision) because floating point addition isn't associative. It can be even worse if the sum is ill conditioned, i.e. a small result from summing large but opposite sign contributions, but it can pretty much always happen a little.


Re: KPAR-related bug in electrostatic_energy()

Posted: Wed Mar 04, 2026 10:20 am
by ahampel

Hi,

we decided now to make actually the break condition testing more robust rather than communicating the computed energies each SCF step. I think this is a viable fix for the problem you are seeing, i.e. making sure in subroutine testBreakCondition() that only the SCF cycle is aborted if all ranks agree on the abort by doing an mpi reduction on the abort logical between KPAR groups. This is a bit more surgical and requires less MPI communication. However, in your specific case we would still be interested to check whether the input to pot_electrostat is correct. The way the break of the SCF cycle is determined did not change much over the past and this is the first time this comes up. Hence, we would like to make sure that this is really the problem here and not that there is potentially another bug hiding. If you don't mind could you provide a set of input files? I understand that I most likely will not trigger the problem, but I can insert prints to check the differences across different kpar groups.

Best,
Alex