Parallel Wannier Projections

Message

jbackman · #1 Post by **jbackman** » Fri Feb 26, 2021 2:02 pm

Dear VASP developers,

I understand that this might be out of the scope of the forum, since it has been asked before:
https://www.vasp.at/forum/viewtopic.php?f=4&t=17273

However, with the latest release 6.2 including some updates to the Wannier90 interface I wanted to check if the Wannier projections have been implemented in parallel in the latest release, or if there are any plans on doing this?

Best,
Jonathan

#2 Post by **merzuk.kaltak** » Mon Mar 01, 2021 9:19 am

Dear Jonathan,

VASP calls the wannier90 library in serial mode, despite the fact that w90 as a stand-alone software package is parallelized.
To the best of my knowledge there is no general purpose parallel interface to the w90-library.

Best,
Merzuk

jbackman · #3 Post by **jbackman** » Wed Apr 14, 2021 9:05 pm

Dear Merzuk,

When using VASP to interface to wannier90 the overlap calculation (wannier90.mmn) seems to be implemented in parallel since this scales with the number of cores used when running VASP with the LWANNIER90 = .TRUE. flag. However, this is not the case for the projections (wannier90.amn). So it looks like part of the code is implemented in parallel.

This is why I asked if the projections have also been addressed in the latest update?

Best,
Jonathan

#4 Post by **henrique_miranda** » Thu Apr 22, 2021 8:31 pm

Dear Jonathan,

We did not change the parallelization scheme for the computation of the initial projections for wannier90 in the last update.
From my experience, the computation of the overlaps (wannnier90.mmn) is the bottleneck.
But we might address this in the future, thanks for pointing it out.

#5 Post by **henrique_miranda** » Thu Apr 22, 2021 9:15 pm

A small addendum to my previous post:
the projections are already computed in parallel (they were also computed in parallel before 6.2).
Is your computation not scaling with the number of cores?
Could you give more information about which system you are looking at and the timings?

jbackman · #6 Post by **jbackman** » Mon Apr 26, 2021 1:00 pm

Dear Henrique,
thanks for your answer.

Yes, for me it does not seem like it scales with the number of cores. It is really just a problem when I work with very large systems (500+ projections), but I here use a small 2D MoS2 system to test the scaling.

Using the WAVECAR from a previous SCF calculation. I use the following INCAR to calculate the necessary wannier90 files.

INCAR:
"
ENCUT = 500 eV
ALGO = None
ISMEAR = 0
SIGMA = 0.1
NELM = 0
EDIFF = 1E-10
GGA = PE
NBANDS = 18
LPLANE = .FALSE.
PREC = Accurate
ADDGRID = .TRUE.
LWAVE = .FALSE.
LCHARG = .FALSE.
LWANNIER90 = .TRUE.
LWRITE_MMN_AMN = .TRUE.
NUM_WANN = 11
WANNIER90_WIN = "
begin projections
Mo:l=2
S:l=1
end projections
search_shells = 70
"
"

KPOINTS:
"
K-Points
0
Monkhorst Pack
17 17 1
0 0 0
"
POSCAR:
"
MoS2 monolayer
3.18300000000000
0.8660254037800000 -0.5000000000000000 0.0000000000000000
0.8660254037800000 0.5000000000000000 0.0000000000000000
0.0000000000000000 0.0000000000000000 6.3291139240499996
Mo S
1 2
Direct
-0.0000000000000000 -0.0000000000000000 0.5000000000000000
0.3333333333352613 0.3333333333352613 0.5776104589503532
0.3333333333352613 0.3333333333352613 0.4223895410496468
"

I measure the time to calculate MMN and AMN by timing the MLWF_WANNIER90_MMN and MLWF_WANNIER90_AMN function calls in mlwf.F.

The time of each projection is measured as the time for each step of the: funcs: DO IFNC=1,SIZE(LPRJ_functions) loop in LPRJ_PROALL function, defined in locproj.F.

I get the following timings when increasing the number of cores.
Cores [#], MMN (s), AMN (s), Proj (s)
1 72.9 20.3 1.85
2 40.4 19.1 1.73
9 25.5 19.5 1.76

To me, it looks like the MMN calculation is scaling but not the AMN calculation. This is also my experiense when dealing with a large system where the AMN calculation becomes very slow. The calculations are done with VASP 6.2.

I'm thankful for any input you have.

Best,
Jonathan

#7 Post by **henrique_miranda** » Tue Apr 27, 2021 7:39 am

Dear Jonathan,

I ran this system for the same number of cores that you have.
Here are the timings I get (compiled with a gnu toolchain):

Code: Select all

Cores   MMN (s)     AMN (s)
1    132.439279   70.092866
2     69.650388   51.882086
9     26.695566   44.132432

While admittedly it is not a very good scaling, at least it does scale somehow.
I would like to understand what is going on i.e. why you don't observe any scaling while my testing shows some.
What does the `Proj` separator mean?

To collect the timings above I needed to add profiling statements in the routines that compute AMN and MMN (they will be included in a future release of VASP).
How did you collect the timings you report?

jbackman · #8 Post by **jbackman** » Tue Apr 27, 2021 9:56 am

Dear Henrique,

Yes, it seems like you observe some scaling when it comes to the MMN and AMN calculations. However, as you could see. I don't see this for AMN.

As I said, I measure the time to run the MLWF_WANNIER90_MMN and MLWF_WANNIER90_AMN function calls in mlwf.F.
I do this using the cpu_time() command.

The reported Proj time is the time of each step of the funcs: DO IFNC=1,SIZE(LPRJ_functions) loop in LPRJ_PROALL function, defined in locproj.F. Also measured using the cpu_time().

If I sum up the individual time for each projection it adds up to that of the full AMN calculation.

My code is compiled with the intel toolchain. Version of intel: intel/19.1.1.217 This is my makefile.include file.

# Precompiler options
CPP_OPTIONS= -DHOST=\"LinuxIFC\"\
-DMPI -DMPI_BLOCK=8000 -Duse_collective \
-DscaLAPACK \
-DCACHE_SIZE=4000 \
-Davoidalloc \
-Dvasp6 \
-Duse_bse_te \
-Dtbdyn \
-Dfock_dblbuf \
-Duse_shmem \
-DVASP2WANNIER90

CPP = fpp -f_com=no -free -w0 $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)

FC = ftn
FCL = ftn -mkl=sequential

FREE = -free -names lowercase

FFLAGS = -assume byterecl -w -xHOST
OFLAG = -O2
OFLAG_IN = $(OFLAG)
DEBUG = -O0

MKL_PATH = $(MKLROOT)/lib/intel64
BLAS = $(MKL_PATH)/libmkl_blas95_lp64.a
LAPACK = $(MKL_PATH)/libmkl_lapack95_lp64.a
BLACS = -lmkl_blacs_intelmpi_lp64
SCALAPACK = -L$(MKL_PATH) -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64

WANNIER = /users/jbackman/wannier90/wannier90-3.1.0/libwannier.a

OBJECTS = fftmpiw.o fftmpi_map.o fft3dlib.o fftw3d.o

INCS = -I$(MKLROOT)/include/fftw

LLIBS = $(SCALAPACK) $(BLACS) $(LAPACK) $(BLAS) $(WANNIER)

OBJECTS_O1 += fftw3d.o fftmpi.o fftmpiw.o
OBJECTS_O2 += fft3dlib.o

# For what used to be vasp.5.lib
CPP_LIB = $(CPP)
FC_LIB = $(FC)
CC_LIB = cc
CFLAGS_LIB = -O
FFLAGS_LIB = -O1
FREE_LIB = $(FREE)

OBJECTS_LIB= linpack_double.o getshmem.o

# For the parser library
CXX_PARS = CC
LLIBS += -lstdc++

# Normally no need to change this
SRCDIR = ../../src
BINDIR = ../../bin

MPI_INC = $(MPICH_DIR)/include

Best,
Jonathan

#9 Post by **henrique_miranda** » Wed Apr 28, 2021 1:46 pm

Dear Jonathan,

I was able to obtain timings similar to yours when compiling the code with intel/19.1.2.254 and mkl/2020.2.254.

Code: Select all

Cores   MMN (s)     AMN (s)
1     58.068596   23.260825
2     33.101691   22.814278
9     20.505906   25.768041

I don't have yet an explanation for this i.e. why the AMN computation scales when compiled with a gnu toolchain while it does not scale with intel. In principle, both should scale.
In any case, the intel version is much faster than the gnu one. It is clear that on intel hardware one should if possible use the intel compiler and mkl for optimal performance.
I will investigate further the reason for these differences and get back to you.

jbackman · #10 Post by **jbackman** » Mon May 10, 2021 11:05 am

Dear Henrique,

thank you for looking into the issue. Any news on the cause of the problem and a possible solution?

Best,
Jonthan

#11 Post by **henrique_miranda** » Tue May 11, 2021 10:25 am

Dear Jonathan,

Yes, we've looked into this issue.
The reason that the code is not scaling is that the part of the computation that is distributed (dot-product between the WF and projection WF) is not necessarily the most expensive. Using the intel toolchain this part is so fast that you don't see a difference in the final timing.
There are other possible parallelization schemes we are considering to distribute the computation of the projection WFs.
This becomes more important for systems with a lot of atoms and few k-points (or gamma-only).

For this particular example (few atoms and a lot of k-points), a k-point parallelization (KPAR=N) is better.
Unfortunately, KPAR is currently not used to distribute the computation of the projections.
Only a minor modification of the code is required, so we will try to include it in a future release of VASP.

Kind regards,
Henrique Miranda

jbackman · #12 Post by **jbackman** » Tue May 11, 2021 2:17 pm

Dear Henrique,

again, thank you for looking into the issue.

I agree with your conclusions for the posted example, where k-point parallelization could be a good option. I however think the bigger problem is for large systems with a lot of atoms and few k-points. What do you think is the best option in this case where we don't have many k-point? Projection parallelization? Do you have any estimate for when such a new release would be available?

In your testing, what is the most expensive part of the projection calculation with the intel toolchain?

Best regards,
Jonathan

#13 Post by **henrique_miranda** » Wed May 12, 2021 12:48 pm

Dear Jonathan,

From my testing, the slowest part is the computation of the local wavefunctions (CONSTRUCT_RYLM routine).
These wavefunctions are currently generated on all the MPI nodes, only the evaluation of the dot-product with the Bloch orbitals is distributed.
We are looking into ways to improve the speed and scaling of this part of the code.
There are many possibilities (distributing the computation of the local wavefunctions being one of them) but they often involve a trade-off between computation and communication, it is hard to say what the best strategy is without implementing and testing.

For this particular case you showed (MoS2) I find that the KPAR parallelization is adequate.
We are testing other strategies for systems with many atoms and a few k-points.

I cannot say yet when this improvement will be implemented and released.
While the current code is not the fastest possible it is not so slow either.
We have already used it to compute projections on systems with ~1000 atoms.
Do you have some applications in mind where the current implementation is the limiting factor?

Kind regards,
Henrique Miranda

joel_eaves · #14 Post by **joel_eaves** » Thu Aug 26, 2021 11:52 pm

Dear Henrique,

My name is Peyton Cline, and I'm a postdoc in Prof. Joel Eaves' group at the University of Colorado Boulder. I just wanted to revive this thread because I am also interested in using this KPAR parallelization routine you mentioned to speed up the writing of the AMN files. For my research, I perform the vasp2wannier conversion process on surface-slab systems comprising up to 300 atoms (1000+ electrons); however, my University has strict limits on using its computer resources. I have access to one of the supercomputers here, with ample compute time, and the system itself is quite fast. But unfortunately, all calculations are limited to a maximum 7-day wall-time. I've corresponded numerous times with the people who manage our supercomputer, and they have said each time they do not allow extensions beyond 7 days.

So far, I have been successful in converting surface-slab systems comprising 100-200 atoms at the quality settings I desire (e.g. well-converged ENCUT, well-converged k-point meshes), but recently, I find myself having to sacrifice quality in order to fit my calculations within the 7-day wall-time. For example, my most recent calculations on roughly 300-atom slabs demand that I lower my ENCUT value to nearly the default cutoff in order to stay within this 7-day limit. I know this isn't a deal breaker as far as publishing papers is concerned, but I would like to be able to use more elevated cutoffs if possible.

For my largest calculations, I still use a 3x3x1 gamma-centered k-point grid, which for ISYM=0 yields 5 unique k-points. Even a relatively minor speedup (2x for example) via a KPAR parallelization implementation would allow me to run my calculations at the desired quality I want to enforce. And I imagine k-point parallelization should be almost linear, meaning I could potentially see something close to a 5x speedup if I parallelize over 5 k-points.

My group has a license to VASP.6.2.0, and if only a "minor modification of the code" is needed, I was wondering if you could supply this change so that we can edit our version of the code. If this isn't possible, I was wondering when you were planning on releasing the next VASP update, and if that update will have this KPAR parallelization implemented for the vasp2wannier process?

Many thanks,
Peyton Cline
Postdoctoral Researcher
Eaves Group
University of Colorado Boulder

#15 Post by **henrique_miranda** » Fri Aug 27, 2021 11:53 am

Thank you for this report!

I am a bit surprised that such a calculation takes more than 7 days. I think it should not.
I would need more details about it (INCAR, POSCAR, POTCAR, OUTCAR) to give more specific advice.

Let me suggest to you a few things you check:
1. Since you are doing a slab calculation I believe you have some vacuum in your unit cell. Did you check that this vaccuum is the smallest possible to not affect the final results you are interested in? If the vacuum is larger than necessary you might be wasting computational resources. Because VASP is a plane-wave-based code, the computation becomes more expensive even if you just add vacuum in the unit cell.
2. Which parallelization scheme are you using (NPAR,KPAR)? How many MPI ranks?
2. What is taking more time in your calculation? the SCF part, the computation of the AMN and MMN, or the wannierization? Note that in principle these three steps can be separated in different runs, allowing you to checkpoint your calculation and use the best parallelization scheme in each of them. In the SCF part of the calculation, you can write a WAVECAR file (if not too large on the hard drive) or a CHGCAR which can be used to start a second NSCF calculation that will produce the AMN and MMN files (LWANNIER90=.TRUE.). In the SCF part you can use KPAR/=1 and NCORE/=1), in the NSCF part and generation of AMN and MMN it is necessary that NCORE=1 and currently there is no benefit from using KPAR/=1.
The third step is the wannierization and it can be invoked as standalone (using wannier90.x which will read the AMN and MMN files generated by VASP).
3. The generation of the AMN and MMN (when LWANNIER90=.TRUE.) is currently parallelized over bands which is the default for VASP.
4. Running wannier90 in library mode (LWANNIER90_RUN=.TRUE.) only works in serial so it is not recommended for large systems.

The new parallelization I mentioned in the previous post will only speed up the generation of the AMN and MMN when (KPAR/=1). I would be surprised if the generation of the AMN or MMN is the most expensive part of your calculation and even if it is, KPAR/=1 is unlikely to be the best parallelization scheme for large systems.

My Community

Parallel Wannier Projections

Parallel Wannier Projections

Re: Parallel Wannier Projections

Re: Parallel Wannier Projections

Re: Parallel Wannier Projections

Re: Parallel Wannier Projections

Re: Parallel Wannier Projections

Re: Parallel Wannier Projections

Re: Parallel Wannier Projections

Re: Parallel Wannier Projections

Re: Parallel Wannier Projections

Re: Parallel Wannier Projections

Re: Parallel Wannier Projections

Re: Parallel Wannier Projections

Re: Parallel Wannier Projections

Re: Parallel Wannier Projections