Page 1 of 1

GPU error : cuCtxSynchronize returned error 214

Posted: Tue Jul 26, 2022 5:29 am
by paulfons
As part of a much larger Raman scattering calculation using phon3py, I have a large series of input files for different displacements to allow computation of derivatives of the dielectric tensor. The code runs fine using parallel Intel MPI. When I try to run the same input files on my Nvidia Ampere GPU, the code always crashes in the same place (where the cpu version, although slower, has no problems). I have attached the console output below. I have also attached the input files and OUTCAR file as an attachment. Any suggestions as to what might wrong. I have tried to run the same job (with different NSIM values) to see if it a memory issue, but the code always crashes at the same place.

Code: Select all

(base) paulfons@kaon:/data/Vasp/NICT/MnTe/MnTe_alpha/af/phonons/phon3py/disp-00559>!vi
(base) paulfons@kaon:/data/Vasp/NICT/MnTe/MnTe_alpha/af/phonons/phon3py/disp-00559>!mpi
mpirun -n 1 /data/Software/Vasp/vasp.6.3.2/bin_gpu/vasp_std
 running on    1 total cores
 distrk:  each k-point on    1 cores,    1 groups
 distr:  one band on    1 cores,    1 groups
 OpenACC runtime initialized ...    1 GPUs detected
 vasp.6.3.2 27Jun22 (build Jul 19 2022 16:18:56) complex                         
 POSCAR found type information on POSCAR MnTe
 POSCAR found :  2 types and      32 ions
 scaLAPACK will be used selectively (only on CPU)
|                                                                             |
|               ----> ADVICE to this user running VASP <----                  |
|                                                                             |
|     You have a (more or less) 'large supercell' and for larger cells it     |
|     might be more efficient to use real-space projection operators.         |
|     Therefore, try LREAL= Auto in the INCAR file.                           |
|     Mind: For very accurate calculation, you might also keep the            |
|     reciprocal projection scheme (i.e. LREAL=.FALSE.).                      |
|                                                                             |

 LDA part: xc-table for Pade appr. of Perdew
 POSCAR, INCAR and KPOINTS ok, starting setup
 FFT: planning ... GRIDC
 FFT: planning ... GRID_SOFT
 FFT: planning ... GRID
 WAVECAR not read
 WARNING: random wavefunctions but no delay for mixing, default for NELMDL
 entering main loop
       N       E                     dE             d eps       ncg     rms          rms(c)
DAV:   1     0.175251143835E+04    0.17525E+04   -0.75528E+04  4960   0.152E+03
DAV:   2     0.113236267777E+03   -0.16393E+04   -0.16148E+04  4960   0.415E+02
DAV:   3    -0.156690869764E+03   -0.26993E+03   -0.26540E+03  7424   0.179E+02
DAV:   4    -0.174048430883E+03   -0.17358E+02   -0.17282E+02  8000   0.542E+01
DAV:   5    -0.174559906444E+03   -0.51148E+00   -0.51073E+00  7876   0.103E+01    0.383E+01
DAV:   6    -0.165910802496E+03    0.86491E+01   -0.12774E+02  8000   0.126E+02    0.425E+01
DAV:   7    -0.169713918389E+03   -0.38031E+01   -0.32403E+01  6080   0.169E+01    0.261E+01
DAV:   8    -0.169481482195E+03    0.23244E+00   -0.16604E+00  6976   0.840E+00    0.149E+01
DAV:   9    -0.169602802943E+03   -0.12132E+00   -0.78061E-01  6656   0.520E+00    0.330E+00
DAV:  10    -0.169565820351E+03    0.36983E-01   -0.22230E-01  7392   0.387E+00    0.233E+00
DAV:  11    -0.169567489048E+03   -0.16687E-02   -0.41704E-02  7392   0.773E-01    0.145E+00
DAV:  12    -0.169565077447E+03    0.24116E-02   -0.84935E-03  8544   0.294E-01    0.626E-01
DAV:  13    -0.169564303591E+03    0.77386E-03   -0.51175E-03  8768   0.337E-01    0.119E-01
DAV:  14    -0.169564422875E+03   -0.11928E-03   -0.44176E-04  8160   0.103E-01    0.122E-01
DAV:  15    -0.169564421433E+03    0.14420E-05   -0.77640E-05  7744   0.375E-02    0.361E-02
DAV:  16    -0.169564418908E+03    0.25256E-05   -0.92158E-06  6816   0.119E-02    0.120E-02
DAV:  17    -0.169564417794E+03    0.11140E-05   -0.26154E-06  8320   0.851E-03    0.570E-03
DAV:  18    -0.169564417940E+03   -0.14590E-06   -0.42598E-07  8416   0.324E-03    0.264E-03
DAV:  19    -0.169564417357E+03    0.58232E-06   -0.16410E-07  7552   0.216E-03    0.119E-03
DAV:  20    -0.169564417463E+03   -0.10538E-06   -0.24085E-08  5376   0.707E-04    0.700E-04
DAV:  21    -0.169564417484E+03   -0.21687E-07   -0.51509E-09  4160   0.337E-04    0.175E-04
DAV:  22    -0.169564417492E+03   -0.77941E-08   -0.48097E-09  4160   0.298E-04
   1 F= -.16956442E+03 E0= -.16956463E+03  d E =0.644280E-03  mag=    -0.0002
 Linear response reoptimize wavefunctions to high precision
DAV:   1    -0.169564417491E+03    0.76398E-09   -0.15442E-09  4160   0.280E-04
DAV:   2    -0.169564417491E+03   -0.58890E-10   -0.59245E-10  4160   0.113E-04
 Linear response G [H, r] |phi>, progress :
  Direction:   1
       N       E                     dE             d eps       ncg     rms
Failing in Thread:1
call to cuCtxSynchronize returned error 214: Other

Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[33005,1],0]
  Exit code:    1

Re: GPU error : cuCtxSynchronize returned error 214

Posted: Fri Aug 05, 2022 2:00 pm
by henrique_miranda
I tried to reproduce the problem by running your INCAR file on our local machines.
Unfortunately (or fortunately) I could not reproduce the error.
Could you please share a little bit more information about the toolchain you used (versions of the compilers) as well as the makefile.include?

Re: GPU error : cuCtxSynchronize returned error 214

Posted: Fri Aug 05, 2022 2:14 pm
by paulfons
Thank you for your reply. I suspect that there may be a hardware issue as the nvidia-smi command indicates corruption in the infoROM. The vendor is now consulting with Nvidia about the problem.