VASP 6.1.1 - GPU fails (CUDA Error in cuda_mem.cu)

questions related to VASP with GPU support (vasp.5.4.1, version released in Feb 2016)

Moderators: Global Moderator, Moderator

Post Reply
Message
Author
bert_tijskens
Newbie
Newbie
Posts: 7
Joined: Tue Jul 22, 2014 9:40 am
License Nr.: 5-568

VASP 6.1.1 - GPU fails (CUDA Error in cuda_mem.cu)

#1 Post by bert_tijskens » Tue Oct 13, 2020 4:50 pm

Hello

We’ve successfully compiled VASP 6.1.1 with Intel 2020 (compilers, MKL, MPI) and CUDA 11.
The CPU version passes all tests, except SiC_TDHSE
The GPU version crashes. The job has full access to both GPU devices.
This is the first test from the testsuite:

Code: Select all

bulk_GaAs_ACFDT_RPR step DFT
entering run_vasp
Using device 0 (rank 1, local rank 1, local size 4) : Tesla P100-PCIE-16GB
Using device 1 (rank 3, local rank 3, local size 4) : Tesla P100-PCIE-16GB
Using device 1 (rank 2, local rank 2, local size 4) : Tesla P100-PCIE-16GB
Using device 0 (rank 0, local rank 0, local size 4) : Tesla P100-PCIE-16GB
 running on    4 total cores
 distrk:  each k-point on    2 cores,    2 groups
 distr:  one band on    1 cores,    2 groups
 using from now: INCAR

[…]

 POSCAR found :  2 types and       2 ions

[…]

 LDA part: xc-table for Pade appr. of Perdew

CUDA Error in cuda_mem.cu, line 44: all CUDA-capable devices are busy or unavailable
 Failed to register pinned memory!
[pa2:30823:0:30823] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:  30823) ====
 0 0x000000000004cb95 ucs_debug_print_backtrace()  ???:0
 1 0x00000000019bf1fb __cuda_error()  /dev/shm/vasp.6.1.1/build/gpu/CUDA/cuda_globals.h:59
 2 0x00000000019bf1fb nvpinnedmalloc_C()  /dev/shm/vasp.6.1.1/build/gpu/CUDA/cuda_mem.cu:43
 3 0x00000000005edcb2 wave_mp_gen_layout_()  ???:0
 4 0x0000000001821cc6 MAIN__()  ???:0
 5 0x000000000040cfd2 main()  ???:0
 6 0x0000000000022555 __libc_start_main()  ???:0
 7 0x000000000040cee9 _start()  ???:0
=================================
creating 32 CUDA streams...
creating 32 CUDA streams...
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
vasp_gpu           00000000019FBB1A  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B1369625630  Unknown               Unknown  Unknown
vasp_gpu           00000000019BF1FB  Unknown               Unknown  Unknown
vasp_gpu           00000000005EDCB2  Unknown               Unknown  Unknown
vasp_gpu           0000000001821CC6  Unknown               Unknown  Unknown
vasp_gpu           000000000040CFD2  Unknown               Unknown  Unknown
libc-2.17.so       00002B1369B56555  __libc_start_main     Unknown  Unknown
vasp_gpu           000000000040CEE9  Unknown               Unknown  Unknown

ferenc_karsai
Global Moderator
Global Moderator
Posts: 422
Joined: Mon Nov 04, 2019 12:44 pm

Re: VASP 6.1.1 - GPU fails (CUDA Error in cuda_mem.cu)

#2 Post by ferenc_karsai » Fri Oct 23, 2020 11:02 am

This example consist of several VASP calculations. Could you please run them separately in the same order to better pinpoint the error?

Please also upload your makefile.include.

Post Reply