KPAR, BRMIX and Wiki entry

Problems running VASP: crashes, internal errors, "wrong" results.


Moderators: Global Moderator, Moderator

Post Reply
Message
Author
MBaeker
Newbie
Newbie
Posts: 25
Joined: Tue Jan 07, 2014 11:22 am

KPAR, BRMIX and Wiki entry

#1 Post by MBaeker » Mon Sep 14, 2020 11:23 am

I just found out that BRMIX errors may be due to an incorrectly chosen KPAR.

I ran a 108-atom cell, using 12 processors with 96 cores each and KPAR=12.
The run almost converged to the correct result, then suddenly the energy jumped to a different value. A few electronic iterations later, the run ended with BRMIX errors.

The simulation had 20 KPOINTS (NKPTS in the OUTCAR). Restarting it with KPAR=10 on 10 processors resulted in a well-converging run.

I think that in general, KPAR should be a divisor of the number of KPOINTS (NKPTS); is this correct?

If yes, I think it would be very helpful to people to add this information to the Wiki.

merzuk.kaltak
Administrator
Administrator
Posts: 282
Joined: Mon Sep 24, 2018 9:39 am

Re: KPAR, BRMIX and Wiki entry

#2 Post by merzuk.kaltak » Fri Sep 18, 2020 8:40 am

Hello,

thank you for reporting this. Please upload an bug report (POSCAR, POTCAR, INCAR, KPOINTS, OUTCAR, stdout).
This would help us and the community to understand the problem.

Regarding your question: KPAR should be a divisor of the number of MPI ranks and ideally also a divisor of the number of k-points.

MBaeker
Newbie
Newbie
Posts: 25
Joined: Tue Jan 07, 2014 11:22 am

Re: KPAR, BRMIX and Wiki entry

#3 Post by MBaeker » Fri Sep 18, 2020 9:15 am

O.k., I've attached a tar-file containing the files needed to run the job.
You do not have the required permissions to view the files attached to this post.

henrique_miranda
Global Moderator
Global Moderator
Posts: 483
Joined: Mon Nov 04, 2019 12:41 pm
Contact:

Re: KPAR, BRMIX and Wiki entry

#4 Post by henrique_miranda » Thu Oct 01, 2020 9:59 am

We appreciate your bug report, however it is a bit difficult for us to test such a large calculation.
Do I understand correctly that you are using 12 nodes with 96 cores each, so about 1152 cores in total?

Did you start both calculations from scratch?
That could explain why one of the calculations crashes and the other one does not.
Could you also post the OUTCAR and OSZICAR for both runs?

In general we try to reproduce the issues reported by the users but in this case its complicated due to the size of the system.
Can you reproduce the issue on a smaller system?

PS: You should not post the POTCAR files themselves but only the first line of each POTCAR. I changed your post.

MBaeker
Newbie
Newbie
Posts: 25
Joined: Tue Jan 07, 2014 11:22 am

Re: KPAR, BRMIX and Wiki entry

#5 Post by MBaeker » Thu Oct 01, 2020 10:15 am

Hi,
sorry, but so far these problems only happend on 108-atom cells, not on smaller systems.

I attach the OUTCAR/OSZICAR for the working and non-working run.
All the best,
Martin.

PS: Sorry for including the POTCARs, I forgot that they are not to be shared.
You do not have the required permissions to view the files attached to this post.

henrique_miranda
Global Moderator
Global Moderator
Posts: 483
Joined: Mon Nov 04, 2019 12:41 pm
Contact:

Re: KPAR, BRMIX and Wiki entry

#6 Post by henrique_miranda » Thu Oct 01, 2020 11:30 am

Ok, comparing the two files it seems that indeed the only difference is really the KPAR and the number of cores.
(we always need to check, some users might inadvertently start from a WAVECAR which will naturally produce different results).

Once the energy is nearly converged, in the KPAR=12 case the calculation starts to diverge.
The energies from the first 40 iterations look the same.
This might be a bug or due to fixed-point arithmetic. Let's try to exclude the later.
Fixed-point arithmetic is strange: sums are non-commutative.
Changing the parallelization scheme alters the order of the sums which in turn leads to slightly different numeric results.
Slightly different numeric results might make the difference between convergence or not for the RMM-DIIS algorithm (ALGO=FAST).

Could you run these exact same calculations only changing to ALGO=Normal?
The k-point parallelization is the same for ALGO=Normal or ALGO=Fast, if the issue is gone with ALGO=Normal then what might be happening
is that once the states are nearly converged the calculation starts to diverge due to the RMM-DIIS algorithm.
This might happen due to slight numerical differences arising due to different parallelization configurations.
In this case, I would not call it a bug, even though it really looks like one :)

PS: The convergence tolerance EDIFF is very strict for such a large system.
Are you sure you need such stringent criteria?

Post Reply