MPI error at large NBANDS

Problems running VASP: crashes, internal errors, "wrong" results.

Moderators: Global Moderator, Moderator

Locked
Message
Author
yueguang_shi
Newbie
Newbie
Posts: 4
Joined: Sun Nov 17, 2019 10:45 pm

MPI error at large NBANDS

#1 Post by yueguang_shi » Sun Mar 29, 2020 11:00 pm

Hello,

I have been using vasp 5.4.4 for a while; it's compiled on our university cluster with openmpi 2.1.2, and our cluster runs SGE job management system.

I started to notice that VASP reliably bugs out for certain combinations of NPAR, NCORE and NBANDS. There seems to be a maximum allowed NBANDS for each combination of NPAR and NCORE. For example, when NCORE=4, NPAR=6, the highest NBANDS is 96; this means that if I set NBANDS to any value equal or less than 96, this calculation will run smoothly, but if I set NBANDS to any value greater than 96, the calculation bugs out after displaying a MPI error. This bug happens very consistently, and can be reproduced 100% of the time. I will show the error messages at the end of the post.

Because NPAR and NCORE should multiply to the total number of cores, there are a finite set of combinations of NPAR and NCORE for a certain core number. A highest allowed NBANDS is found for all of the combinations, beyond which this bug occurs reliably.

I tested and I concluded that k-mesh, KPAR, the use of SGE system, ALGO(Davidson vs steepest descent), PREC, NSIM, ENCUT, number of atoms, memory size are NOT part of the problem.

This bug is usually not a problem for me for small unit cells due to the low number of bands needed. However, it's stopping me from being able to perform some supercell calculations as more atoms -> more electrons -> more bands needed.

Since this bug that's obvious to me is not seen reported here, I suspect this has to do with how VASP was compiled on our cluster with openmpi. I did consult our cluster IT support but they had no idea either.

Thanks,
Yueguang Shi


*This is the 2nd time I am making this post, the 1st post didn't go through it seems, as I forgot to attach the zip file as required. Sorry if the 1st post actually went through and this would then be a duplicate post.


I will be posting a set of sample outputs from the MPI bug below, note that there are some variations in what error messages I am getting.


*********************************************************************************************************************************************

stdout and stderr of SGE system:

[machine:56705] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[machine:56705] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

*********************************************************************************************************************************************

end of VASP stdout:

entering main loop
N E dE d eps ncg rms rms(c)
[machine:56715] *** An error occurred in MPI_Bcast
[machine:56715] *** reported by process [995426305,47163035877381]
[machine:56715] *** on communicator MPI COMMUNICATOR 14 SPLIT FROM 12
[machine:56715] *** MPI_ERR_TRUNCATE: message truncated
[machine:56715] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[machine:56715] *** and potentially your MPI job)

*********************************************************************************************************************************************

end of VASP OUTCAR:

--------------------------------------- Iteration 1( 1) ---------------------------------------


POTLOK: cpu time 0.9113: real time 0.9166
SETDIJ: cpu time 0.0106: real time 0.0106

*********************************************************************************************************************************************

*The files I attached as bug_test.zip are disguised versions of my real calculations / inputs, but should show the essence of the MPI bug nonetheless. I have been working on my calculations for a while and my calculations definitely work perfectly well as long as the MPI error doesn't occur.
You do not have the required permissions to view the files attached to this post.

merzuk.kaltak
Administrator
Administrator
Posts: 277
Joined: Mon Sep 24, 2018 9:39 am

Re: MPI error at large NBANDS

#2 Post by merzuk.kaltak » Mon Mar 30, 2020 8:58 am

Hello,

Setting NCORE or NPAR gives certain restrictions on NBANDS.
You are running your job on a 24 core node, that most probably contains two 12 core sockets.
So when setting NCORE = 4, VASP sets NPAR=6 internally. The reason for this setting is described here.
Due to the total number of electrons NELECT in your system, the minimum number of bands is therefore 96.

The reason why the program terminates, is because your settings NBANDS = 102 ; NCORE=4 for a 24 core node do not make sense.
To fix this, you either have to change NBANDS (for instance to 120) or NCORE.

yueguang_shi
Newbie
Newbie
Posts: 4
Joined: Sun Nov 17, 2019 10:45 pm

Re: MPI error at large NBANDS

#3 Post by yueguang_shi » Tue Mar 31, 2020 5:02 am

Thanks for your response.

I fully understand that when NCORE=4, VASP sets NPAR=6. I did a lot of tests with different core counts and combinations of NCORE and NPAR, following the rule that NCORE*NPAR=total core count.

However, I don't quite understand why "The reason why the program terminates, is because your settings NBANDS = 102 ; NCORE=4 for a 24 core node do not make sense." How does NCORE=4 and NPAR=6 not make sense exactly? And what does NBANDS have to do with it? NBANDS being 102 is an integer multiple of NPAR=6.

As for your suggestion to increase NBANDS to for instance 120 or to change NCORE, I just did a run with NBANDS=120 (all else the same) just to be sure and indeed the same MPI bug occurred again. I also previously tested a lot of combinations of NCORE and NPAR, as I said earlier, and there seems to always be a maximum NBANDS allowed, beyond which the MPI error always happens; in other words, the problem is definitely not specific to NCORE=4 and NPAR=6.

Thanks,
Yueguang Shi

merzuk.kaltak
Administrator
Administrator
Posts: 277
Joined: Mon Sep 24, 2018 9:39 am

Re: MPI error at large NBANDS

#4 Post by merzuk.kaltak » Tue Mar 31, 2020 12:52 pm

You are right, vasp accepts your input (even though it is not the optimal choice for your setup).
In fact, version 5.4.4 finishes the job successfully (see proof.zip).
You probably have to talk to your administrator or provide additional information about the software packages used to compile vasp.
This info should include the used libraries, versions and the corresponding makefile.include (see posting guidelines of the bug-forum).
You do not have the required permissions to view the files attached to this post.

yueguang_shi
Newbie
Newbie
Posts: 4
Joined: Sun Nov 17, 2019 10:45 pm

Re: MPI error at large NBANDS

#5 Post by yueguang_shi » Tue Mar 31, 2020 10:54 pm

Thanks for your response. It's nice to hear that the problem probably has to do with my compilation of VASP. I emailed our cluster administrator and below is his reply:

Vasp was built with openmpi-2.1.2 and Intel Parallel Studio 2017.4, with the included MKL. It looks like the only change to the distributed makefile was to use the openmpi compiler wrappers rather than Intel and add support for wannier90. The version of wannier90 is 2.1.0, also compiled with Intel Parallel Studio 2017.4 and openmpi-2.1.2.

Thanks,
Yueguang Shi

merzuk.kaltak
Administrator
Administrator
Posts: 277
Joined: Mon Sep 24, 2018 9:39 am

Re: MPI error at large NBANDS

#6 Post by merzuk.kaltak » Wed Apr 01, 2020 11:09 am

Most probably, the reason is due to the incompatibility of OpenMPI-2.1.2 with the BLACS wrapper of MKL-Scalapack.
MKL-2017 supports officially OpenMPI-1.8.x, see here for instance.
A solution to this might be to compile Scalapack with OpenMPI-2.1.2 and link to the MKL wrappers of FFTW, BLAS and LAPACK.
Intel provides a useful linke line advisor.

yueguang_shi
Newbie
Newbie
Posts: 4
Joined: Sun Nov 17, 2019 10:45 pm

Re: MPI error at large NBANDS

#7 Post by yueguang_shi » Tue Apr 07, 2020 2:12 am

Thanks for the help, I was able to coordinate with the cluster admin and get the problem fixed after a recompilation. Now that I tried a several different inputs that previously failed and they're all working now.

And as for the fixed we performed, quoting the cluster admin: "I wound up building netlib-scalapack, which can actually use the BLAS and Lapack libraries from MKL. It turns out that netlib-scalapack was recently updated in November 2019, after almost 6 years of no updates."

Thanks again,
Yueguang Shi

Locked