VASP Forum

Posted: **Wed Aug 27, 2025 9:19 pm**

I am running VASP 6.5.1 on a 4xA100 GPU node on Perlmutter (https://docs.nersc.gov/systems/perlmutter/architecture) and frequently run into an error that is specific to this machine when ALGO = All.

The error seen in the stderr is:

Code: Select all

 -----------------------------------------------------------------------------
|                     _     ____    _    _    _____     _                     |
|                    | |   |  _ \  | |  | |  / ____|   | |                    |
|                    | |   | |_) | | |  | | | |  __    | |                    |
|                    |_|   |  _ <  | |  | | | | |_ |   |_|                    |
|                     _    | |_) | | |__| | | |__| |    _                     |
|                    (_)   |____/   \____/   \_____|   (_)                    |
|                                                                             |
|     internal error in: rot.F  at line: 822                                  |
|                                                                             |
|     EDWAV: internal error, the gradient is not orthogonal 1 2 6.446e-4      |
|                                                                             |
|     If you are not a developer, you should not encounter this problem.      |
|     Please submit a bug report.                                             |
|                                                                             |
 -----------------------------------------------------------------------------

I recognize that this is a problem that the VASP team has not been able to independently reproduce in the past, but I am reporting another instance of it in case it helps.

You can find one of many examples attached. I compiled VASP with the Cray-specific PrgEnv-nvidia module. I believe this error is very specific to how VASP is compiled. In the past, I have seen this error be sensitive to the optimization flag when compiling VASP. However, changing OFLAG from -fast to -O2 did not change anything.

My makefile.include is also attached:

Code: Select all

# Precompiler options
CPP_OPTIONS= -DHOST=\"LinuxNV_CrayMPICH\" \
             -DMPI -DMPI_BLOCK=8000 -Duse_collective \
             -DscaLAPACK \
             -DCACHE_SIZE=4000 \
             -Davoidalloc \
             -Dvasp6 \
             -Duse_bse_te \
             -Dtbdyn \
             -Dqd_emulate \
             -Dfock_dblbuf \
             -D_OPENMP \
             -DACC_OFFLOAD \
             -DNVCUDA \
             -DUSENCCL \
             -DPROFILING \
             -DVASP_HDF5 \
             -DVASP2WANNIER90 \
             -Dlibvaspml \
             -DVASPML_USE_CBLAS
#             -DDFTD4
### Disabled for GPU build:
#             -Dsysv \
#             -DUSELIBXC \
#             -Dlibbeef \

CPP        = nvfortran -Mpreprocess -Mfree -Mextend -E $(CPP_OPTIONS) $*$(FUFFIX)  > $*$(SUFFIX)
CC         = nvc -mp -acc -gpu=cc80 #,cuda11.8
FC         = ftn -mp -acc=gpu -gpu=cc80
FCL        = ftn -v -mp -acc=gpu -gpu=cc80 -c++libs
FREE       = -Mfree

FFLAGS     = -Mbackslash -Mlarge_arrays
OFLAG      = -fast
DEBUG      = -Mfree -O0 -traceback
#OBJECTS    = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o
LLIBS       = -cudalib=cublas,cusolver,cufft,nccl -cuda

# Redefine the standard list of O1 and O2 objects
SOURCE_O1  := pade_fit.o minimax_dependence.o wave_window.o
SOURCE_O2  := pead.o

# For what used to be vasp.5.lib
CPP_LIB     = $(CPP)
FC_LIB      = $(FC)
CC_LIB      = $(CC)
CFLAGS_LIB  = -O -w
FFLAGS_LIB  = -O1 -Mfixed
FREE_LIB    = $(FREE)
OBJECTS_LIB = linpack_double.o

# For the parser library
CXX_PARS    = nvc++ --no_warnings

#######

# Specify your NV HPC-SDK installation (mandatory)
NVROOT      =$(shell which nvfortran | awk -F /compilers/bin/nvfortran '{ print $$1 }')

# Software emulation of quadruple precision (mandatory)
QD         ?= $(NVROOT)/compilers/extras/qd
LLIBS      += -L$(QD)/lib -lqdmod -lqd -Wl,-rpath=$(QD)/lib
INCS       += -I$(QD)/include/qd
LLIBS      += -L$(NVROOT)/math_libs/lib64 -Wl,-rpath=$(NVROOT)/math_libs/lib64

# mandatory
BLAS        = -lblas
LAPACK      = -llapack
SCALAPACK   = -L/global/cfs/cdirs/omp/local/scalapack-2.1.0/nvidia-22.5/milan -lscalapack
LLIBS      += $(SCALAPACK) $(LAPACK) $(BLAS)

# use cray-fftw module for FFTs
# optional packages:

# Use cusolvermp (not on Perlmutter as does not work with SS11/libfabric)
# supported as of NVHPC-SDK 24.1 (and needs CUDA-11.8)
#CPP_OPTIONS+= -DCUSOLVERMP
#LLIBS      += -cudalib=cusolvermp
#CFLAGS_LIB += -cudalib=cusolvermp
#OBJECTS_LIB+= cal_mpi.o

# NCCL (GPU builds only)
LLIBS     += -L$(NCCL_DIR)/lib -Wl,-rpath=$(NCCL_DIR)/lib
INCS      += -I$(NCCL_DIR)/include

# HDF5 (vasp >6.2.0 only)
LLIBS     += -L$(HDF5_ROOT)/lib -lhdf5_fortran
INCS      += -I$(HDF5_ROOT)/include

# fftlib
#FCL        += fftlib.o
#CXX_FFTLIB  = nvc++ -mp --no_warnings -std=c++11 -DFFTLIB_THREADSAFE
#INCS_FFTLIB = -I./include -I$(FFTW_ROOT)/include
#LIBS       += fftlib
#LLIBS      += -ldl

# For machine learning library vaspml (experimental)
CXX_ML      = CC
CXXFLAGS_ML = -O3 --c++17 --no_warnings

# PATH FOR PLUGIN BUILDS
EXTLIBDIR   = $(CFS)/omp/local

# wannier90
WANNIER90_ROOT = /global/common/software/nersc9/vasp/dependencies/wannier90/nvidia-23.9/3.1.0/lib
LLIBS      += -L$(WANNIER90_ROOT) -lwannier

Posted: **Mon Sep 01, 2025 4:15 pm**

Here is a related thread: https://vasp.at/forum/viewtopic.php?t=19276.

Unfortunately or fortunately depending on your perspective, I reran the attached calculation and it did not reproduce the error, making it very difficult to debug. Nonetheless, I have dozens of examples of this error since they appear frequently albeit unpredictably.

Posted: **Mon Sep 01, 2025 4:49 pm**

Hello!

Sorry for the late reply and many thanks for uploading all the relevant files! I tried to reproduce this issue on a machine with 2xA30 and an EPYC 7713 (as close as I could get) but unfortunately I was not yet able to get VASP to fail. I found a small hint at the problem when looking at the energies: if one compares our runs one can clearly see a significant jump in your case:

toten-vs.step.png

However, it seems that this does not correspond to equally drastic changes in ionic positions... actually there is very little change visible in the trajectory. Now, this jump may still be unrelated but I think it is rather suspicious that it happens just two steps before the error occurs. Also, just before the run crashes the electronic steps did not converge within the allowed steps (150). So.. these are some puzzling facts... unfortunately at the moment I do not have an idea where to continue searching without being able to reproduce the issue.

At least I have one suggestion for you: maybe you could try ISEARCH = 1 in future runs. This should improve the robustness of ALGO = All. Since you mentioned that it is hard to reproduce the error... it would be great if you could notify us if you ever found that the error still persists, even in the presence of this tag. Then we can infer that it is likely not related to the conjugate gradient robustness itself. On the other hand, if the issue never comes back it would be even greater news and highly appreciated if you could inform us about it.

Since you mentioned that there are many examples of this error in your data... are these all simulations of the same system using the same hardware and compilers? If not - and if it is not too much effort - it would be fantastic if you could zip together a few of these examples and upload them here? Maybe we have better luck at reproducing the error there...

All the best,
Andreas Singraber

Posted: **Mon Sep 01, 2025 7:24 pm**

Hi Andreas! No worries about the delay.

So, first a note. I have uploaded a separate calculation that has failed for the same reason. This time, I was able to get the error to occur upon rerunning the job, but it does not occur at the same geometry step. In run 1, the calculation fails with the rot.F error at geometry step 42. In run 2, the calculation fails at geometry step 28. I have not investigated them in depth, but I will report back if I notice anything. As far as I can tell, the calculation input files should be identical and run exactly the same way. You will see at the top of the OUTCAR that the VASP build date is different, but I did not change how I recompiled it (I accidentally deleted the binaries and had to rebuild...). I will report others if the graduate student in my group still has them around.

Regarding the prior example, that jump in energy does seem suspiciously different.

We will try ISEARCH = 1 and hopefully can report back about how that goes! Thank you for your time!

Posted: **Mon Sep 01, 2025 10:05 pm**

Sadly, no luck with ISEARCH = 1. The same issue occurs, although the SCF and optimization trajectories differ now. It now crashes at geometry step 30, as noted in the slurm .out file.

Posted: **Thu Sep 11, 2025 8:51 am**

Hello!

Thanks for the additional failure cases and for testing ISEARCH=1. It seems the sudden jump may be a wrong lead because I do not see these jumps in the newer uploads. Maybe it is still related but I will not focus on this at the moment. In the meantime I also did some additional digging but unfortunately I was not yet able to find similar behavior as you reported. I had a closer look at the variable which triggers the error message at rot.F:820. The actual value is written in the error message, it is 6.446e-4 in this case (the threshold is 1.0E-4):

Code: Select all

EDWAV: internal error, the gradient is not orthogonal 1 2 6.446e-4

I checked the variable while running your test cases and found that it would never grow larger than 1.0E-12. If you are willing to invest some time into further debugging you could also give it a try and add something like

Code: Select all

WRITE(*, *) "DORT ", WDES%COMM%NODE_ME, " : ", DORT

just before this block in rot.F line 820:

Code: Select all

IF (ABS(DORT)>1E-4) THEN
   CALL vtutor%bug("EDWAV: internal error, the gradient is not orthogonal " // str(NK) // &
      " " // str(NP) // " " // str(DORT), __FILE__, __LINE__)
ENDIF

It would be interesting to see whether in your case there is a sudden spike above the threshold, if there is a drift towards the threshold, or, you are constantly close to the threshold and then randomly crossing it because of fluctuations. Of course it is totally fine if you cannot afford the time to do this test, after all it's not your job to debug VASP ;D. However, since you mentioned a Cray-specific module it may be only possible to reproduce this on your system and at some point I have to give up if I cannot find this behavior on my test systems. Anyway, I still have something left to try which is using 4 GPUs instead of just 2. Have you actually ever observed the error when using less than 4 GPUs?

All the best,
Andreas Singraber

Posted: **Mon Sep 15, 2025 4:20 am**

Thanks, Andreas!! I will test this out early next week and let you know. I am more than happy to dig into this further, especially because it is so difficult to reproduce externally.

I consulted with the staff at NERSC who was able to reproduce the bug as well. We tried setting LUSENCCL = .FALSE. but no luck there.

I'll be in touch with more info soon!

Posted: **Fri Sep 19, 2025 11:06 pm**

A small update: This error is not restricted to the one machine I was using. We have now received the same error on a different machine, DeltaAI (https://docs.ncsa.illinois.edu/systems/ ... en/latest/). Both machines that reproduce the error are GPU machines and use the HPE/Cray-provided user environments.

I will continue testing with the above suggestion soon.

VASP Forum

internal error in: rot.F at line: 822

internal error in: rot.F at line: 822

Re: internal error in: rot.F at line: 822

Re: internal error in: rot.F at line: 822

Re: internal error in: rot.F at line: 822

Re: internal error in: rot.F at line: 822

Re: internal error in: rot.F at line: 822

Re: internal error in: rot.F at line: 822

Re: internal error in: rot.F at line: 822