strange behavior of PAW double counting energy

Message

yingma · #1 Post by **yingma** » Sat Jul 17, 2021 5:38 pm

I recently notice a very strange behavior. I submitted the same job (all the input files are exactly the same) to a single node, two nodes, and three nodes. The cluster that I'm running on has 64 cores per node. While the results for one and two nodes are almost exactly the same with some numerical differences, the total energy from three nodes came out to be NaN. I looked at the OUTPUT and realized that it is because of the PAW doubling counting energy (PAWAE), which is NaN. All the other energy terms are the same as in the case of one or two nodes. This is the only difference, everything else, including force, temperature, etc., are also the same.

It is really weird and I do not know why. Could you please advise? Thanks a lot. all the files are attached.

#2 Post by **henrique_miranda** » Mon Jul 19, 2021 1:15 pm

This is indeed very strange.
Unfortunately, I was not able to reproduce this issue locally using the latest version of VASP compiled with a gcc toolchain.

Could you give me some more information about your compilation?
I would like to see the 'makefile.include' as well as which compiler and libraries were used if possible.
It is very difficult to track down the issue if I cannot reproduce it locally.

yingma · #3 Post by **yingma** » Tue Jul 20, 2021 9:48 pm

Thank you. Please see makefile.include below. Please note that I'm using the Cray LibSci math library and Cray ftn compiler, which links LAPACK, BLAS, and fftw libraries automatically. Furthermore, I'm using openmpi 4.1.1, compiled by cray compilers.

Code: Select all

# Precompiler options
CPP_OPTIONS= -DHOST=\"LinuxGNU\" \
             -DMPI -DMPI_BLOCK=65536 -Duse_collective \
             -DCACHE_SIZE=8000 \
             -Davoidalloc \
             -Dvasp6 \
             -Duse_bse_te \
             -Dtbdyn  \
             -Dfock_dblbuf

CPP        = gcc -E -P -C -w $*$(FUFFIX) >$*$(SUFFIX) $(CPP_OPTIONS)
#CPP        = fpp -f_com=no -free -w0  $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)
#CPP        = cpp --traditional -P $(CPP_OPTIONS) $*$(FUFFIX) $*$(SUFFIX)
#FC         = /data/users/yingma/dir.src/openmpi/bin/mpif90
#FCL        = /data/users/yingma/dir.src/openmpi/bin/mpif90
FC         = mpif90
FCL        = mpif90

FREE       = -ffree

FFLAGS     = -dC -rmo -emEb -hnoomp -N1023
OFLAG      = -O3
OFLAG_IN   = $(OFLAG)
DEBUG      = -O0
BLAS       =
LAPACK     =
BLACS      =
SCALAPACK  =

LLIBS      = $(SCALAPACK) $(LAPACK) $(BLAS)

#FFTW       = /opt/cray/pe/fftw/3.3.8.8/x86_rome
#LLIBS      += -L$(FFTW)/lib -lfftw3
#INCS       = -I$(FFTW)/include

OBJECTS    = fftmpiw.o fftmpi_map.o  fftw3d.o  fft3dlib.o

OBJECTS_O1 += fftw3d.o fftmpi.o fftmpiw.o
OBJECTS_O2 += fft3dlib.o

# For what used to be vasp.5.lib
CPP_LIB    = $(CPP)
FC_LIB     = $(FC)
CC_LIB     = cc
CFLAGS_LIB = -O
FFLAGS_LIB = -O1
FREE_LIB   = $(FREE)
OBJECTS_LIB= linpack_double.o getshmem.o

# For the parser library
CXX_PARS   = CC
LLIBS      += -lstdc++


# Normally no need to change this
SRCDIR     = ../../src
BINDIR     = ../../bin
#================================================
# GPU Stuff

CPP_GPU    = -DCUDA_GPU -DRPROMU_CPROJ_OVERLAP -DCUFFT_MIN=28 -UscaLAPACK -Ufock_dblbuf # -DUSE_PINNED_MEMORY

OBJECTS_GPU= fftmpiw.o fftmpi_map.o fft3dlib.o fftw3d_gpu.o fftmpiw_gpu.o

CC         = cc
CXX        = CC
CFLAGS     = -fPIC -DADD_ -openmp -DMAGMA_WITH_MKL -DMAGMA_SETAFFINITY -DGPUSHMEM=300 -DHAVE_CUBLAS

# Minimal requirement is CUDA >= 10.X. For "sm_80" you need CUDA >= 11.X.
CUDA_ROOT  ?= /usr/local/cuda
NVCC       := $(CUDA_ROOT)/bin/nvcc
CUDA_LIB   := -L$(CUDA_ROOT)/lib64 -lnvToolsExt -lcudart -lcuda -lcufft -lcublas

GENCODE_ARCH    := -gencode=arch=compute_60,code=\"sm_60,compute_60\" \
                   -gencode=arch=compute_70,code=\"sm_70,compute_70\" \
                   -gencode=arch=compute_80,code=\"sm_80,compute_80\"

MPI_INC    = /opt/gnu/ompi-3.1.4-GNU-5.4.0/include

#4 Post by **henrique_miranda** » Wed Jul 21, 2021 11:57 am

Thanks for the makefile.include.
If I remember correctly we encountered some issues when compiling the latest distributed version of VASP with the latest cray compilers.
We are working to integrate the necessary changes in the next release.

Which version of the cray compiler are you using?

Code: Select all

mpif90 -V

Did you modify any part of the VASP source code in order to be able to compile VASP with the cray compilers?
If so what modifications have you done?

yingma · #5 Post by **yingma** » Wed Jul 21, 2021 8:08 pm

We are using cray compiler version 11.0.3. Yes I did make some (minor) changes to the source code, including:

in scpc.F, access=append needs to be changed to position=append (there are a few occasions)
in minimax.F, remove comma in WRITE(*,1), in line 2435 and 2607.
in In pade_fit.F, SUBROUTINE PADE_SVD_EVAL, variable name Q(:) needs to be changed

in addition, fast_aug.F does not compiler with -O2 optimization level, but is fine with -O1.

If building hybrid mpi/openmpi, openmp parallel do on line 1154 and 115 needs to be commented out in hamil_lrf.F. Not sure why.

#6 Post by **henrique_miranda** » Mon Jul 26, 2021 9:15 am

Ok, all these changes sound familiar.

Unfortunately, I was not able to reproduce this issue on our local machines.
We have no access to cray hardware and compilers so it's difficult to reproduce this issue (in particular for a job running on 192 cores).
There are a few steps that I would take to try and track down this issue (in this order):
1. Try recompiling the code with decreasing level of optimization (-O3 is not recommended and normally is not much better than -O2, if -O2 produces errors then try -O0)
2. Run the VASP test suite and report if there are any failed tests.
3. Try trapping floating-point exceptions (search "trap=" in https://www.nersc.gov/assets/Documentat ... ayftn.html).
4. Try compiling the same source (after the modifications you did for cray) with a different compiler.
5. Try to reproduce this issue on a smaller system with fewer MPI ranks.

If all of these steps do not work then the best option to track down this issue would be if you could somehow grant us access to the machine where you observed this problem.

yingma · #7 Post by **yingma** » Mon Aug 02, 2021 7:18 pm

It looks like it is related to variables with no initial values. I recompiled VASP with -e0 option, which initializes all undefined variables to be zero, and the strange behavior is gone. Some other compilers, by default, set all undefined to be zero, and Cray is not the case.

That said, I don't know which variables lead to this behavior.

#8 Post by **henrique_miranda** » Thu Aug 05, 2021 8:57 am

Thanks a lot for this information!
This gives us a good hint to try to identify what causes this problem and fix it in the code.
In the meantime, it might be a good idea to initialize the values to zero in the compiler.

My Community

strange behavior of PAW double counting energy

strange behavior of PAW double counting energy

Re: strange behavior of PAW double counting energy

Re: strange behavior of PAW double counting energy

Re: strange behavior of PAW double counting energy

Re: strange behavior of PAW double counting energy

Re: strange behavior of PAW double counting energy

Re: strange behavior of PAW double counting energy

Re: strange behavior of PAW double counting energy