VASP 6.3.0 crashes several nodes on lonestar 6

Problems running VASP: crashes, internal errors, "wrong" results.

Moderators: Global Moderator, Moderator

Post Reply
Message
Author
nicholas_dimakis1
Newbie
Newbie
Posts: 17
Joined: Tue Sep 15, 2020 3:36 pm

VASP 6.3.0 crashes several nodes on lonestar 6

#1 Post by nicholas_dimakis1 » Tue May 10, 2022 10:50 pm

Hello

Running VASP 6.3.0 on TACC lonestar 6 super computer causes several nodes to go down. This is the TACC VASP compiled version. This is a small description of ls6 (https://portal.tacc.utexas.edu/user-guides/lonestar6)

Compute Nodes

Lonestar6 hosts 560 compute nodes with 5 TFlops of peak performance per node and 256 GB of DRAM.

Table 1. Compute Node Specifications

CPU: 2x AMD EPYC 7763 64-Core Processor ("Milan")
Total cores per node: 128 cores on two sockets (64 cores / socket )
Hardware threads per core: 1 per core
Hardware threads per node: 128 x 1 = 128
Clock rate: 2.45 GHz (Boost up to 3.5 GHz)
RAM: 256 GB (3200 MT/s) DDR4
Cache: 32KB L1 data cache per core
512KB L2 per core
32 MB L3 per core complex
(1 core complex contains 8 cores)
256 MB L3 total (8 core complexes )
Each socket can cache up to 288 MB
(sum of L2 and L3 capacity)
Local storage: 144GB /tmp partition on a 288GB SSD.

Below is the vasp.mpi that is used to run the job.

POSCAR.gz
I cannot attach OUTCAR since this file is too large. However, this is the information regarding memory used from OUTCAR

total amount of memory used by VASP MPI-rank0 297598. kBytes
=======================================================================

base : 30000. kBytes
nonl-proj : 37999. kBytes
fftplans : 7237. kBytes
grid : 7721. kBytes
one-center: 93. kBytes
wavefun : 214548. kBytes

Any suggestions are very welcomed.

Thanks-Nick
You do not have the required permissions to view the files attached to this post.

andreas.singraber
Global Moderator
Global Moderator
Posts: 231
Joined: Mon Apr 26, 2021 7:40 am

Re: VASP 6.3.0 crashes several nodes on lonestar 6

#2 Post by andreas.singraber » Tue May 10, 2022 11:36 pm

Hello Nick!

Can you be more specific about how "nodes go down"? Does VASP crash, is there an error message? Please post also the stdout and stderr of the job if you still have it. How many cores/nodes do you use? Can you maybe post your submit script? Which VASP version do you use and how was it compiled? Do you use shared memory? Can you please attach the corresponding makefile.include?

Thank you!

Best,
Andreas Singraber

nicholas_dimakis1
Newbie
Newbie
Posts: 17
Joined: Tue Sep 15, 2020 3:36 pm

Re: VASP 6.3.0 crashes several nodes on lonestar 6

#3 Post by nicholas_dimakis1 » Wed May 11, 2022 12:23 am

Hi Andreas,

Thank you for your reply. The run mpi script is as follows:

#!/bin/bash
#SBATCH -J vasp
#SBATCH -o vasp.%j.out
#SBATCH -e vasp.%j.err
#SBATCH -n 1280
#SBATCH -N 15
#SBATCH -p normal
#SBATCH -t 24:00:00
#SBATCH -A CHE21028

#moldule load intel/18.0.2
module load intel/19.1.1
module load impi/19.0.9
#module load cray_mpich/7.7.3
#module swap intel intel/17.0.4
#module load vasp/6.1.2
module load vasp/6.3.0
#module load vasp/5.4.4
ibrun vasp_std > vasp_test.out

This means I am using 15 nodes with 1280 processors in total for VASP 6.3.0. Each node has 128 processors but I used more nodes to have more RAM per CPU. The vasp.184186.err is attached.
I will contact TACC support to get information on the makefile.

The code caused several nodes of ls6 to stop working (crashed) and a reboot was needed.

Thanks-Nick
You do not have the required permissions to view the files attached to this post.

nicholas_dimakis1
Newbie
Newbie
Posts: 17
Joined: Tue Sep 15, 2020 3:36 pm

Re: VASP 6.3.0 crashes several nodes on lonestar 6

#4 Post by nicholas_dimakis1 » Wed May 11, 2022 3:42 pm

Hello

This is the information I got from TACC on the makefile.include

Hi,

It is great for you to contact the developers.

Attached is the makefile.include to build 6.3.0 on ls6.

Please also send them the following output from diff command, to compare their recommended one

staff.ls6(1007)$ diff makefile.include.intel_omp ../makefile.include
52c52
< FFLAGS += -xHOST
---
> FFLAGS += -mavx2
56,57c56,57
< FCL += -qmkl
< MKLROOT ?= /path/to/your/mkl/installation
---
> FCL += -mkl
> #MKLROOT ?= /path/to/your/mkl/installation
68,70c68,75
< #CPP_OPTIONS += -DVASP2WANNIER90
< #WANNIER90_ROOT ?= /path/to/your/wannier90/installation
< #LLIBS += -L$(WANNIER90_ROOT)/lib -lwannier
---
> CPP_OPTIONS += -DVASP2WANNIER90
> WANNIER90_ROOT ?= ../../../wannier90-3.1.0/
> LLIBS += -L$(WANNIER90_ROOT)/ -lwannier
>
> # For Libbeef (optional)
> CPP_OPTIONS += -Dlibbeef
> LIBBEEF_ROOT ?= ../../../libbeef/
> LLIBS += -L$(LIBBEEF_ROOT)/lib -lbeef

As you and they will see that, the one used on ls6 is basically the same as theirs, the only real difference is the -mavx2 to replace the -XHOST, to fit the AMD arch on ls6. The build is on top of intel_compilers_and_libraries_2020.1.217. with the intel compiler suite 19.1.1 and its associated MKL, and intelMPI 19.0.9.

Regards
Hang Liu

nicholas_dimakis1
Newbie
Newbie
Posts: 17
Joined: Tue Sep 15, 2020 3:36 pm

Re: VASP 6.3.0 crashes several nodes on lonestar 6

#5 Post by nicholas_dimakis1 » Wed May 11, 2022 7:11 pm

Hello

This is the makefile.include that was used by TACC to compile VASP 6.3.0 under lonestar 6.

# Default precompiler options
CPP_OPTIONS = -DHOST=\"LinuxIFC\" \
-DMPI -DMPI_BLOCK=8000 -Duse_collective \
-DscaLAPACK \
-DCACHE_SIZE=4000 \
-Davoidalloc \
-Dvasp6 \
-Duse_bse_te \
-Dtbdyn \
-Dfock_dblbuf \
-D_OPENMP

CPP = fpp -f_com=no -free -w0 $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)

FC = mpiifort -qopenmp
FCL = mpiifort

FREE = -free -names lowercase

FFLAGS = -assume byterecl -w

OFLAG = -O2
OFLAG_IN = $(OFLAG)
DEBUG = -O0

OBJECTS = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o
OBJECTS_O1 += fftw3d.o fftmpi.o fftmpiw.o
OBJECTS_O2 += fft3dlib.o

# For what used to be vasp.5.lib
CPP_LIB = $(CPP)
FC_LIB = $(FC)
CC_LIB = icc
CFLAGS_LIB = -O
FFLAGS_LIB = -O1
FREE_LIB = $(FREE)

OBJECTS_LIB = linpack_double.o

# For the parser library
CXX_PARS = icpc
LLIBS = -lstdc++

##
## Customize as of this point! Of course you may change the preceding
## part of this file as well if you like, but it should rarely be
## necessary ...
##

# When compiling on the target machine itself, change this to the
# relevant target when cross-compiling for another architecture
FFLAGS += -mavx2

# Intel MKL (FFTW, BLAS, LAPACK, and scaLAPACK)
# (Note: for Intel Parallel Studio's MKL use -mkl instead of -qmkl)
FCL += -mkl
#MKLROOT ?= /path/to/your/mkl/installation
LLIBS += -L$(MKLROOT)/lib/intel64 -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64
INCS =-I$(MKLROOT)/include/fftw

# HDF5-support (optional but strongly recommended)
#CPP_OPTIONS+= -DVASP_HDF5
#HDF5_ROOT ?= /path/to/your/hdf5/installation
#LLIBS += -L$(HDF5_ROOT)/lib -lhdf5_fortran
#INCS += -I$(HDF5_ROOT)/include

# For the VASP-2-Wannier90 interface (optional)
CPP_OPTIONS += -DVASP2WANNIER90
WANNIER90_ROOT ?= ../../../wannier90-3.1.0/
LLIBS += -L$(WANNIER90_ROOT)/ -lwannier

# For Libbeef (optional)
CPP_OPTIONS += -Dlibbeef
LIBBEEF_ROOT ?= ../../../libbeef/
LLIBS += -L$(LIBBEEF_ROOT)/lib -lbeef

# For the fftlib library (experimental)
#FCL = mpiifort fftlib.o -qmkl
#CXX_FFTLIB = icpc -qopenmp -std=c++11 -DFFTLIB_USE_MKL -DFFTLIB_THREADSAFE
#INCS_FFTLIB = -I./include -I$(MKLROOT)/include/fftw
#LIBS += fftlib

andreas.singraber
Global Moderator
Global Moderator
Posts: 231
Joined: Mon Apr 26, 2021 7:40 am

Re: VASP 6.3.0 crashes several nodes on lonestar 6

#6 Post by andreas.singraber » Fri May 20, 2022 2:52 pm

Hello!

Sorry for the delay! I could not find anything suspicious looking at the makefile.include, these are all standard settings. The SLURM log file says the job was cancelled due to a time limit. Is that actually the case or did VASP crash before and hung until it was cancelled? Did you try to run the same job again in the meantime?

To be honest, there is not many ways I could imagine how VASP can force a node into an unrecoverable state where it needs to be rebooted manually. The only thing that comes to mind is a lack of sufficient memory. It can happen that the operating system cannot recover properly when an application overflows the available memory. Especially, if there is no swap space reserved on the disk which would slow down additional allocations once you run out of RAM.

I would recommend to run the simulation with relaxed settings, e.g. less k-points and check the memory requirements. Try to extrapolate how much memory the full job would need and if there is actually the right amount available.

Best,

Andreas Singraber

Post Reply