Segmentation fault after 2 iter VASP 5.4.4 on GPU

questions related to VASP with GPU support (vasp.5.4.1, version released in Feb 2016)

Moderators: Global Moderator, Moderator

Post Reply
Message
Author
kutov_danil
Newbie
Newbie
Posts: 1
Joined: Wed Nov 13, 2019 8:22 am

Segmentation fault after 2 iter VASP 5.4.4 on GPU

#1 Post by kutov_danil » Wed Jan 29, 2020 1:08 pm

Hi,

We got segmentation fault every time after 2 iterations of MD simulation when trying to run large supercells with VASP 5.4.4 with CUDA libs. On the other hand, the energy optimization of this system (IBRION = 2, ISIF = 3) works well!
We have a box with 243 atoms of alpha-quartz.
Our supercomputer consists of Intel Haswell-EP E5-2697v3 nodes with NVidia Tesla K40M GPU cards.
We use this modules, when runing and compiling VASP: intel/15.0.3, openmpi/2.1.1-icc, cuda/6.5, mkl/11.1.3.

Input test files INCAR, KPOINTS and POSCAR are below:

Code: Select all

SYSTEM = SiO2: alfa-quartz, Si 81, O 162, 243 atoms, SuperCell 3x3x3
LWAVE  =  .FALSE.
LCHARG = .FALSE. 
LREAL=A 
ISYM = 0
ISMEAR = 0 
SIGMA = 0.1
#ENCUT = 600.0   
IBRION = 0
MDALGO = 3    
ISIF = 3
#SMASS = -1
LANGEVIN_GAMMA = 30.0 30.0
LANGEVIN_GAMMA_L = 30.0
PMASS = 3840       
ALGO = VeryFast
PREC = Normal
TEBEG = 300
NSW = 10    
POTIM = 1.0  

Code: Select all

K-Points
0
Gamma
1 1 1
0 0 0 

Code: Select all

SiO2_quartz
1.0
       14.7390003204         0.0000000000         0.0000000000
       -7.3695001602        12.7643487039         0.0000000000
        0.0000000000         0.0000000000        16.2155990601
    O   Si
  162   81
Cartesian
..... 
There is an example of error, when running task on 1 GPU card:

Code: Select all

Using device 0 (rank 0, local rank 0, local size 1) : Tesla K40st
 running on    1 total cores
 distrk:  each k-point on    1 cores,    1 groups
 distr:  one band on    1 cores,    1 groups
 using from now: INCAR     
  
 *******************************************************************************
  You are running the GPU port of VASP! When publishing results obtained with
  this version, please cite:
   - M. Hacene et al., http://dx.doi.org/10.1002/jcc.23096
   - M. Hutchinson and M. Widom, http://dx.doi.org/10.1016/j.cpc.2012.02.017
  
  in addition to the usual required citations (see manual).
  
  GPU developers: A. Anciaux-Sedrakian, C. Angerer, and M. Hutchinson.
 *******************************************************************************
  
 vasp.5.4.4.18Apr17-6-g9f103f2a35 (build Jul 03 2017 17:00:58) complex          
  
 POSCAR found type information on POSCAR  O  Si
 POSCAR found :  2 types and     243 ions
 LDA part: xc-table for Pade appr. of Perdew
  
 WARNING: The GPU port of VASP has been extensively
 tested for: ALGO=Normal, Fast, and VeryFast.
 Other algorithms may produce incorrect results or
 yield suboptimal performance. Handle with care!
  
 POSCAR, INCAR and KPOINTS ok, starting setup
creating 32 CUDA streams...
creating 32 CUFFT plans with grid size 72 x 70 x 70...
 FFT: planning ...
 WAVECAR not read
 prediction of wavefunctions initialized - no I/O
 ######################################################################
 entering main loop
       N       E                     dE             d eps       ncg     rms          rms(c)
RMM:   1     0.220862746483E+05    0.22086E+05   -0.36918E+05   777   0.988E+02
RMM:   2     0.150356269278E+05   -0.70506E+04   -0.66219E+04   777   0.461E+02
RMM:   3     0.947948630441E+04   -0.55561E+04   -0.40170E+04   777   0.352E+02
RMM:   4     0.633797469509E+04   -0.31415E+04   -0.24216E+04   777   0.279E+02
RMM:   5     0.441019924087E+04   -0.19278E+04   -0.15981E+04   777   0.236E+02
RMM:   6     0.315396338638E+04   -0.12562E+04   -0.11116E+04   777   0.208E+02
RMM:   7     0.223813602408E+04   -0.91583E+03   -0.85601E+03   777   0.189E+02
RMM:   8     0.151620648562E+04   -0.72193E+03   -0.69290E+03   777   0.174E+02
RMM:   9    -0.778374908781E+03   -0.22946E+04   -0.13638E+04  2292   0.163E+02
RMM:  10    -0.165099748848E+04   -0.87262E+03   -0.52765E+03  2276   0.487E+01
RMM:  11    -0.188654282149E+04   -0.23555E+03   -0.25003E+03  1716   0.600E+01
RMM:  12    -0.202262859422E+04   -0.13609E+03   -0.13550E+03  1996   0.186E+01    0.112E+02
RMM:  13    -0.181428587026E+04    0.20834E+03   -0.77143E+02  2134   0.404E+01    0.831E+01
RMM:  14    -0.183042132334E+04   -0.16135E+02   -0.33408E+02  2058   0.191E+01    0.646E+01
RMM:  15    -0.185533603019E+04   -0.24915E+02   -0.20262E+02  1945   0.164E+01    0.301E+01
RMM:  16    -0.184354463284E+04    0.11791E+02   -0.47213E+01  2040   0.104E+01    0.842E+00
RMM:  17    -0.184462327420E+04   -0.10786E+01   -0.26974E+01  1965   0.707E+00    0.931E+00
RMM:  18    -0.184396313235E+04    0.66014E+00   -0.40891E+00  1919   0.427E+00    0.115E+00
RMM:  19    -0.184419551414E+04   -0.23238E+00   -0.20106E+00  1916   0.188E+00    0.434E+00
RMM:  20    -0.184403837638E+04    0.15714E+00   -0.38680E-01  1885   0.132E+00    0.198E+00
RMM:  21    -0.184403372340E+04    0.46530E-02   -0.35328E-01  1888   0.720E-01    0.316E-01
RMM:  22    -0.184403669105E+04   -0.29676E-02   -0.39324E-02  1751   0.490E-01    0.336E-01
RMM:  23    -0.184403607685E+04    0.61420E-03   -0.53081E-03  1690   0.115E-01    0.177E-01
RMM:  24    -0.184403630741E+04   -0.23055E-03   -0.95524E-04  1328   0.788E-02    0.206E-01
RMM:  25    -0.184403598416E+04    0.32324E-03   -0.12132E-03  1610   0.513E-02    0.367E-02
RMM:  26    -0.184403601390E+04   -0.29732E-04   -0.45587E-04  1195   0.595E-02
    1 T=   280. E= -.18351244E+04 F= -.18440360E+04 E0= -.18440360E+04  EK= 0.89116E+01 SP= 0.00E+00 SK= 0.00E+00
 ######################################################################
 bond charge predicted
       N       E                     dE             d eps       ncg     rms          rms(c)
RMM:   1    -0.184272191805E+04    0.13141E+01   -0.43494E+01  2118   0.103E+01    0.114E+00
RMM:   2    -0.184344173004E+04   -0.71981E+00   -0.80263E+00  2056   0.423E+00    0.790E-01
RMM:   3    -0.184344384777E+04   -0.21177E-02   -0.10826E-01  2020   0.283E-01    0.520E-01
RMM:   4    -0.184344172215E+04    0.21256E-02   -0.25429E-02  1828   0.310E-01    0.164E-01
RMM:   5    -0.184344260129E+04   -0.87914E-03   -0.11951E-02  1682   0.163E-01    0.112E-01
RMM:   6    -0.184344251539E+04    0.85906E-04   -0.22211E-03  1679   0.924E-02    0.504E-02
RMM:   7    -0.184344254169E+04   -0.26308E-04   -0.73407E-04  1293   0.387E-02
    2 T=   263. E= -.18350927E+04 F= -.18434425E+04 E0= -.18434425E+04  EK= 0.83498E+01 SP= 0.00E+00 SK= 0.00E+00
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 30925 on node n48618 exited on signal 11 (Segmentation fault).
-------------------------------------------------------------------------- 
The same task was successful when we run MD simulation on CPU version of VASP 5.4.1 on other supercomputer with Intel Xeon X5570 and modules intel/15.0.090, impi/5.0.1, mkl/11.2.0.
The output file of successful MD simulation is below:

Code: Select all

 running on  122 total cores
 distrk:  each k-point on  122 cores,    1 groups
 distr:  one band on    1 cores,  122 groups
 using from now: INCAR     
 vasp.5.4.1 05Feb16 (build Feb 22 2016 23:54:54) complex                        
  
 POSCAR found type information on POSCAR  O  Si
 POSCAR found :  2 types and     243 ions

 ----------------------------------------------------------------------------- 
|                                                                             |
|           W    W    AA    RRRRR   N    N  II  N    N   GGGG   !!!           |
|           W    W   A  A   R    R  NN   N  II  NN   N  G    G  !!!           |
|           W    W  A    A  R    R  N N  N  II  N N  N  G       !!!           |
|           W WW W  AAAAAA  RRRRR   N  N N  II  N  N N  G  GGG   !            |
|           WW  WW  A    A  R   R   N   NN  II  N   NN  G    G                |
|           W    W  A    A  R    R  N    N  II  N    N   GGGG   !!!           |
|                                                                             |
|      For optimal performance we recommend to set                            |
|        NCORE= 4 - approx SQRT( number of cores)                             |
|      NCORE specifies how many cores store one orbital (NPAR=cpu/NCORE).     |
|      This setting can  greatly improve the performance of VASP for DFT.     |
|      The default, NPAR=number of cores might be grossly inefficient         |
|      on modern multi-core architectures or massively parallel machines.     |
|      Do your own testing !!!!                                               |
|      Unfortunately you need to use the default for GW and RPA calculations. |
|      (for HF NCORE is supported but not extensively tested yet)             |
|                                                                             |
 ----------------------------------------------------------------------------- 

 LDA part: xc-table for Pade appr. of Perdew

 ----------------------------------------------------------------------------- 
|                                                                             |
|           W    W    AA    RRRRR   N    N  II  N    N   GGGG   !!!           |
|           W    W   A  A   R    R  NN   N  II  NN   N  G    G  !!!           |
|           W    W  A    A  R    R  N N  N  II  N N  N  G       !!!           |
|           W WW W  AAAAAA  RRRRR   N  N N  II  N  N N  G  GGG   !            |
|           WW  WW  A    A  R   R   N   NN  II  N   NN  G    G                |
|           W    W  A    A  R    R  N    N  II  N    N   GGGG   !!!           |
|                                                                             |
|      VASP found    726 degrees of freedom                                   |
|      the temperature will equal 2*E(kin)/ (degrees of freedom)              |
|      this differs from previous releases, where T was 2*E(kin)/(3 NIONS).   |
|      The new definition is more consistent                                  |
|                                                                             |
 ----------------------------------------------------------------------------- 

 POSCAR, INCAR and KPOINTS ok, starting setup
 WARNING: small aliasing (wrap around) errors must be expected
 FFT: planning ...
 WAVECAR not read
 prediction of wavefunctions initialized - no I/O
 entering main loop
       N       E                     dE             d eps       ncg     rms          rms(c)
RMM:   1     0.216431141959E+05    0.21643E+05   -0.42943E+05   854   0.110E+03
RMM:   2     0.148501855704E+05   -0.67929E+04   -0.68490E+04   854   0.470E+02
RMM:   3     0.878612807270E+04   -0.60641E+04   -0.41293E+04   854   0.342E+02
RMM:   4     0.548647806810E+04   -0.32997E+04   -0.24740E+04   854   0.265E+02
RMM:   5     0.352306914928E+04   -0.19634E+04   -0.15982E+04   854   0.223E+02
RMM:   6     0.225702374355E+04   -0.12660E+04   -0.11169E+04   854   0.197E+02
RMM:   7     0.133041003489E+04   -0.92661E+03   -0.86353E+03   854   0.181E+02
RMM:   8     0.594936726974E+03   -0.73547E+03   -0.70270E+03   854   0.167E+02
RMM:   9    -0.129355660194E+04   -0.18885E+04   -0.16595E+04  2320   0.156E+02
RMM:  10    -0.207907603242E+04   -0.78552E+03   -0.46532E+03  2295   0.444E+01
RMM:  11    -0.213418000652E+04   -0.55104E+02   -0.71835E+02  2103   0.411E+01
RMM:  12    -0.216872659493E+04   -0.34547E+02   -0.32902E+02  1923   0.963E+00    0.111E+02
RMM:  13    -0.190551200897E+04    0.26321E+03   -0.90315E+02  2308   0.466E+01    0.809E+01
RMM:  14    -0.191496197953E+04   -0.94500E+01   -0.23886E+02  2228   0.188E+01    0.661E+01
RMM:  15    -0.192814651192E+04   -0.13185E+02   -0.73264E+01  2040   0.152E+01    0.248E+01
RMM:  16    -0.191420569522E+04    0.13941E+02   -0.25195E+01  1892   0.114E+01    0.836E+00
RMM:  17    -0.191480751409E+04   -0.60182E+00   -0.20339E+01  2144   0.591E+00    0.642E+00
RMM:  18    -0.191480691969E+04    0.59440E-03   -0.19970E+00  1875   0.327E+00    0.345E+00
RMM:  19    -0.191493673947E+04   -0.12982E+00   -0.13952E+00  2010   0.197E+00    0.324E+00
RMM:  20    -0.191485577361E+04    0.80966E-01   -0.14144E-01  1872   0.948E-01    0.226E+00
RMM:  21    -0.191482345668E+04    0.32317E-01   -0.10659E-01  2020   0.468E-01    0.106E+00
RMM:  22    -0.191481861725E+04    0.48394E-02   -0.39937E-02  2208   0.319E-01    0.622E-01
RMM:  23    -0.191481766996E+04    0.94728E-03   -0.21691E-02  1996   0.215E-01    0.227E-01
RMM:  24    -0.191481811091E+04   -0.44095E-03   -0.65005E-03  1906   0.175E-01    0.113E-01
RMM:  25    -0.191481826537E+04   -0.15446E-03   -0.11978E-03  1617   0.676E-02    0.751E-02
RMM:  26    -0.191481832160E+04   -0.56229E-04   -0.25521E-04  1185   0.332E-02
    1 T=   301. E= -.19054124E+04 F= -.19148183E+04 E0= -.19148183E+04  EK= 0.94059E+01 SP= 0.00E+00 SK= 0.00E+00
 bond charge predicted
       N       E                     dE             d eps       ncg     rms          rms(c)
RMM:   1    -0.191380220994E+04    0.10161E+01   -0.49753E+01  2179   0.117E+01    0.126E+00
RMM:   2    -0.191471643144E+04   -0.91422E+00   -0.98691E+00  2115   0.600E+00    0.802E-01
RMM:   3    -0.191473294215E+04   -0.16511E-01   -0.29038E-01  2002   0.618E-01    0.629E-01
RMM:   4    -0.191472627645E+04    0.66657E-02   -0.36815E-02  1834   0.428E-01    0.348E-01
RMM:   5    -0.191472523053E+04    0.10459E-02   -0.32551E-02  1829   0.240E-01    0.165E-01
RMM:   6    -0.191472526388E+04   -0.33345E-04   -0.59988E-03  1685   0.193E-01    0.877E-02
RMM:   7    -0.191472542890E+04   -0.16502E-03   -0.38046E-03  1879   0.876E-02    0.898E-02
RMM:   8    -0.191472530362E+04    0.12528E-03   -0.89987E-04  1499   0.693E-02    0.319E-02
RMM:   9    -0.191472531426E+04   -0.10638E-04   -0.29434E-04  1213   0.241E-02
    2 T=   298. E= -.19054113E+04 F= -.19147253E+04 E0= -.19147253E+04  EK= 0.93141E+01 SP= 0.00E+00 SK= 0.00E+00
 bond charge predicted
 prediction of wavefunctions
       N       E                     dE             d eps       ncg     rms          rms(c)
RMM:   1    -0.191436534444E+04    0.35996E+00   -0.25506E-01  2002   0.955E-01    0.132E-01
RMM:   2    -0.191436957189E+04   -0.42274E-02   -0.43352E-02  1840   0.478E-01    0.522E-02
RMM:   3    -0.191437142206E+04   -0.18502E-02   -0.18556E-02  1708   0.102E-01    0.693E-02
RMM:   4    -0.191437141249E+04    0.95722E-05   -0.78607E-04  1509   0.717E-02
    3 T=   287. E= -.19054083E+04 F= -.19143714E+04 E0= -.19143714E+04  EK= 0.89631E+01 SP= 0.00E+00 SK= 0.00E+00
 bond charge predicted
 prediction of wavefunctions
       N       E                     dE             d eps       ncg     rms          rms(c)
RMM:   1    -0.191381705936E+04    0.55436E+00   -0.62560E-02  2346   0.357E-01    0.644E-02
RMM:   2    -0.191381792608E+04   -0.86672E-03   -0.90220E-03  2069   0.761E-02    0.505E-02
RMM:   3    -0.191381794926E+04   -0.23177E-04   -0.63733E-04  1392   0.549E-02
    4 T=   269. E= -.19054039E+04 F= -.19138179E+04 E0= -.19138179E+04  EK= 0.84141E+01 SP= 0.00E+00 SK= 0.00E+00
 bond charge predicted
 prediction of wavefunctions
       N       E                     dE             d eps       ncg     rms          rms(c)
RMM:   1    -0.191315355558E+04    0.66437E+00   -0.10716E-01  2397   0.339E-01    0.130E-01
RMM:   2    -0.191315456585E+04   -0.10103E-02   -0.12246E-02  2234   0.129E-01    0.420E-02
RMM:   3    -0.191315467423E+04   -0.10838E-03   -0.13635E-03  1542   0.464E-02    0.213E-02
RMM:   4    -0.191315468571E+04   -0.11478E-04   -0.27159E-04  1142   0.289E-02
    5 T=   248. E= -.19053985E+04 F= -.19131547E+04 E0= -.19131547E+04  EK= 0.77561E+01 SP= 0.00E+00 SK= 0.00E+00
 bond charge predicted
 prediction of wavefunctions
       N       E                     dE             d eps       ncg     rms          rms(c)
RMM:   1    -0.191248159339E+04    0.67308E+00   -0.11842E-01  2400   0.367E-01    0.708E-02
RMM:   2    -0.191248295088E+04   -0.13575E-02   -0.13906E-02  2229   0.124E-01    0.451E-02
RMM:   3    -0.191248307928E+04   -0.12841E-03   -0.16352E-03  1562   0.501E-02    0.190E-02
RMM:   4    -0.191248309655E+04   -0.17271E-04   -0.25035E-04  1157   0.374E-02
    6 T=   227. E= -.19053927E+04 F= -.19124831E+04 E0= -.19124831E+04  EK= 0.70904E+01 SP= 0.00E+00 SK= 0.00E+00
 bond charge predicted
 prediction of wavefunctions
       N       E                     dE             d eps       ncg     rms          rms(c)
RMM:   1    -0.191189705599E+04    0.58602E+00   -0.12630E-01  2488   0.315E-01    0.653E-02
RMM:   2    -0.191189786139E+04   -0.80540E-03   -0.85453E-03  2164   0.120E-01    0.278E-02
RMM:   3    -0.191189798238E+04   -0.12099E-03   -0.13235E-03  1594   0.368E-02    0.193E-02
RMM:   4    -0.191189799015E+04   -0.77675E-05   -0.16916E-04  1125   0.277E-02
    7 T=   208. E= -.19053880E+04 F= -.19118980E+04 E0= -.19118980E+04  EK= 0.65100E+01 SP= 0.00E+00 SK= 0.00E+00
Information: wavefunction orthogonal band  838  0.8936
Information: wavefunction orthogonal band  840  0.8926
Information: wavefunction orthogonal band  841  0.8995
Information: wavefunction orthogonal band  842  0.8998
Information: wavefunction orthogonal band  843  0.8968
Information: wavefunction orthogonal band  845  0.8959
Information: wavefunction orthogonal band  846  0.8849
Information: wavefunction orthogonal band  847  0.8939
Information: wavefunction orthogonal band  848  0.8648
Information: wavefunction orthogonal band  849  0.8833
Information: wavefunction orthogonal band  850  0.8827
Information: wavefunction orthogonal band  851  0.8814
Information: wavefunction orthogonal band  852  0.8697
Information: wavefunction orthogonal band  853  0.8780
Information: wavefunction orthogonal band  854  0.8795
 bond charge predicted
 prediction of wavefunctions
       N       E                     dE             d eps       ncg     rms          rms(c)
RMM:   1    -0.191147067616E+04    0.42731E+00   -0.88569E-02  2478   0.268E-01    0.432E-02
RMM:   2    -0.191147130034E+04   -0.62418E-03   -0.64143E-03  2102   0.108E-01    0.276E-02
RMM:   3    -0.191147140183E+04   -0.10149E-03   -0.11003E-03  1559   0.353E-02    0.181E-02
RMM:   4    -0.191147141678E+04   -0.14946E-04   -0.19650E-04  1146   0.312E-02
    8 T=   195. E= -.19053857E+04 F= -.19114714E+04 E0= -.19114714E+04  EK= 0.60857E+01 SP= 0.00E+00 SK= 0.00E+00
Information: wavefunction orthogonal band  853  0.8892
 bond charge predicted
 prediction of wavefunctions
       N       E                     dE             d eps       ncg     rms          rms(c)
RMM:   1    -0.191123955161E+04    0.23185E+00   -0.10302E-01  2426   0.251E-01    0.411E-02
RMM:   2    -0.191124014174E+04   -0.59013E-03   -0.61092E-03  1989   0.139E-01    0.223E-02
RMM:   3    -0.191124029109E+04   -0.14935E-03   -0.15365E-03  1724   0.345E-02    0.161E-02
RMM:   4    -0.191124030456E+04   -0.13473E-04   -0.19214E-04  1146   0.299E-02
    9 T=   187. E= -.19053856E+04 F= -.19112403E+04 E0= -.19112403E+04  EK= 0.58547E+01 SP= 0.00E+00 SK= 0.00E+00
 bond charge predicted
 prediction of wavefunctions
       N       E                     dE             d eps       ncg     rms          rms(c)
RMM:   1    -0.191120247177E+04    0.37819E-01   -0.10602E-01  2401   0.255E-01    0.426E-02
RMM:   2    -0.191120312176E+04   -0.64999E-03   -0.67053E-03  2038   0.140E-01    0.223E-02
RMM:   3    -0.191120327196E+04   -0.15020E-03   -0.15437E-03  1718   0.337E-02    0.158E-02
RMM:   4    -0.191120328759E+04   -0.15632E-04   -0.19134E-04  1146   0.298E-02
   10 T=   186. E= -.19053879E+04 F= -.19112033E+04 E0= -.19112033E+04  EK= 0.58154E+01 SP= 0.00E+00 SK= 0.00E+00
 bond charge predicted
 prediction of wavefunctions
 wavefunctions rotated 
Can anyone help me out?
Thanks!

Post Reply