Page 2 of 2

Re: Linking error compiling GPU version of Vasp 6.3.0

Posted: Fri Mar 25, 2022 9:19 am
by marie-therese.huebsch
Ok, no worries. We can make this work.

1. Did you set LD_LIBRARY_PATH? I cannot see that you added the system library path to your .bashrc. So, did my previous suggestion help to solve the following error:

Code: Select all

/data/Software/Vasp/vasp.6.3.0/bin/vasp_std: error while loading shared libraries: libqdmod.so.0: cannot open shared object file: No such file or directory
2. Regarding the new error:

Code: Select all

/proj/nv/libraries/Linux_x86_64/22.2/openmpi/209518-rel-1/share/openmpi/help-mpi-runtime.txt: No such file or directory.  Sorry!
Did you install OpenMPI yourself? And did you add openmpi/lib to your LD_LIBRARY_PATH?

As a sanity check, you can look where help-mpi-runtime.txt is located on your system. The one that comes with NV 22.2 is actually expected to be at

Code: Select all

 $NVROOT/comm_libs/openmpi/openmpi-3.1.5/share/openmpi/help-mpi-runtime.txt
That is why I assume you installed OpenMPI yourself.

Re: Linking error compiling GPU version of Vasp 6.3.0

Posted: Sat Mar 26, 2022 8:55 am
by paulfons
I have successfully run the simpleMPI example in cuda-samples using the "mpirun -n 32 simpleMPI" example file. I assume I am making a (fundamental) mistake in how to invoke Vasp with the GPU card. Now I realize I have to use mpirun (of the nvidia openmpi installation). This results in many copies of the (second inset) error below which relate to "In most cases this means several MPI-ranks want to share a GPU which is not supported by NCCL" Running vasp_ncl with a simple core "mpirun -n 1 vasp_ncl" runs correctly, but doesn't seem particular fast compared to the CPU version. I assume to get a speedup I need to set KPAR in the INCAR file. Is this correct? Setting KPAR to 56 seems to start correctly, but I encounter numerical errors (see third inset).
The last iset shows the INCAR file I used. The sample input (with the appropriate KPAR setting works fine for the cpu version). Any suggestions as to how to get this test run to work correctly? (thanks!)

Code: Select all

at Mar 26 17:42:15 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A30          Off  | 00000000:CA:00.0 Off |                    0 |
| N/A   29C    P0    31W / 165W |      0MiB / 24576MiB |      3%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Code: Select all

Vasp/GaAs>mpirun -n 32 /data/Software/Vasp/vasp.6.3.0/bin/vasp_ncl
 running on   32 total cores
 distrk:  each k-point on    2 cores,   16 groups
 distr:  one band on    1 cores,    2 groups
 OpenACC runtime initialized ...    1 GPUs detected
 -----------------------------------------------------------------------------
|                                                                             |
|     EEEEEEE  RRRRRR   RRRRRR   OOOOOOO  RRRRRR      ###     ###     ###     |
|     E        R     R  R     R  O     O  R     R     ###     ###     ###     |
|     E        R     R  R     R  O     O  R     R     ###     ###     ###     |
|     EEEEE    RRRRRR   RRRRRR   O     O  RRRRRR       #       #       #      |
|     E        R   R    R   R    O     O  R   R                               |
|     E        R    R   R    R   O     O  R    R      ###     ###     ###     |
|     EEEEEEE  R     R  R     R  OOOOOOO  R     R     ###     ###     ###     |
|                                                                             |
|     M_init_nccl: failed to initialize a NCCL communicator.                  |
|     In most cases this means several MPI-ranks want to share a GPU,         |
|     which is not supported by NCCL. If this is the case, either reduce      |
|     the number of MPI-ranks (#-of-ranks <= #-of-GPUs) or run with           |
|     LUSENCCL = .FALSE.                                                      |
|                                                                             |
|       ---->  I REFUSE TO CONTINUE WITH THIS SICK JOB ... BYE!!! <----       |
|                                                                             |
 -----------------------------------------------------------------------------

Code: Select all

Vasp/GaAs>mpirun -n 1 /data/Software/Vasp/vasp.6.3.0/bin/vasp_ncl
 running on    1 total cores
 distrk:  each k-point on    1 cores,    1 groups
 distr:  one band on    1 cores,    1 groups
 OpenACC runtime initialized ...    1 GPUs detected
 vasp.6.3.0 20Jan22 (build Mar 19 2022 10:49:18) complex                         
 POSCAR found type information on POSCAR GaAs
 POSCAR found :  2 types and       2 ions
 -----------------------------------------------------------------------------
|                                                                             |
|           W    W    AA    RRRRR   N    N  II  N    N   GGGG   !!!           |
|           W    W   A  A   R    R  NN   N  II  NN   N  G    G  !!!           |
|           W    W  A    A  R    R  N N  N  II  N N  N  G       !!!           |
|           W WW W  AAAAAA  RRRRR   N  N N  II  N  N N  G  GGG   !            |
|           WW  WW  A    A  R   R   N   NN  II  N   NN  G    G                |
|           W    W  A    A  R    R  N    N  II  N    N   GGGG   !!!           |
|                                                                             |
|     You use a magnetic or noncollinear calculation, but did not specify     |
|     the initial magnetic moment with the MAGMOM tag. Note that a            |
|     default of 1 will be used for all atoms. This ferromagnetic setup       |
|     may break the symmetry of the crystal, in particular it may rule        |
|     out finding an antiferromagnetic solution. Thence, we recommend         |
|     setting the initial magnetic moment manually or verifying carefully     |
|     that this magnetic setup is desired.                                    |
|                                                                             |
 -----------------------------------------------------------------------------

 scaLAPACK will be used selectively (only on CPU)
 -----------------------------------------------------------------------------
|                                                                             |
|               ----> ADVICE to this user running VASP <----                  |
|                                                                             |
|     You enforced a specific xc type in the INCAR file but a different       |
|     type was found in the POTCAR file.                                      |
|     I HOPE YOU KNOW WHAT YOU ARE DOING!                                     |
|                                                                             |
 -----------------------------------------------------------------------------

 LDA part: xc-table for Pade appr. of Perdew
 found WAVECAR, reading the header
 POSCAR, INCAR and KPOINTS ok, starting setup
 FFT: planning ... GRIDC
 FFT: planning ... GRID_SOFT
 FFT: planning ... GRID
 reading WAVECAR
 the WAVECAR file was read successfully
 augmentation electrons    18.10200827988438     
 soft         electrons    0.000000000000000     
 total        electrons    18.10200827988438     
 augmentation electrons   1.5747565785228942E-002
 soft         electrons    0.000000000000000     
 total        electrons   1.5747565785228942E-002
 augmentation electrons   1.5747565785228942E-002
 soft         electrons    0.000000000000000     
 total        electrons   1.5747565785228942E-002
 augmentation electrons   1.5747565785228942E-002
 soft         electrons    0.000000000000000     
 total        electrons   1.5747565785228942E-002
 augmentation electrons    134.9612722547786     
 soft         electrons    0.000000000000000     
 total        electrons    134.9612722547786     
 augmentation electrons    131.7019698547786     
 soft         electrons    0.000000000000000     
 total        electrons    131.7019698547786     
 augmentation electrons    131.7019698547786     
 soft         electrons    0.000000000000000     
 total        electrons    131.7019698547786     
 augmentation electrons    131.7019698547786     
 soft         electrons    0.000000000000000     
 total        electrons    131.7019698547786     
 reading imaginary part of occupancies ...
 charge-density read from file: unknown                                 
 reading imaginary part of occupancies ...
 magnetization density read from file 1
 entering main loop
       N       E                     dE             d eps       ncg     rms          rms(c)
DAV:   1    -0.906486783406E+01   -0.90649E+01   -0.14944E-04 15296   0.114E-01    0.233E-02
WARNING in EDDRMM: call to ZHEGV failed, returncode =   6  3     21
RMM:   2    -0.906374086794E+01    0.11270E-02   -0.36541E-05 16276   0.619E-02    0.124E-02
RMM:   3    -0.906374171468E+01   -0.84673E-06   -0.50900E-06 18352   0.237E-02    0.161E-03
WARNING in EDDRMM: call to ZHEGV failed, returncode =   6  3     24
WARNING in EDDRMM: call to ZHEGV failed, returncode =   8  4     24
RMM:   4    -0.906374176996E+01   -0.55288E-07   -0.64887E-07 16161   0.825E-03
   1 F= -.90637418E+01 E0= -.90637418E+01  d E =0.000000E+00  mag=    -0.0004    -0.0027     0.0000
 writing wavefunctions
 augmentation electrons    7.718732353504382     
 soft         electrons    10.36378357893159     
 total        electrons    18.08251593243597     
 augmentation electrons   2.0020454222713033E-005
 soft         electrons    10.36378357893159     
 total        electrons  -3.8122818632383288E-004
 augmentation electrons   1.4820234856620306E-004
 soft         electrons    10.36378357893159     
 total        electrons  -2.8444639199147587E-003
 augmentation electrons  -1.2472124912546340E-006
 soft         electrons    10.36378357893159     
 total        electrons   2.4033158347232809E-005
Warning: ieee_invalid is signaling
Warning: ieee_divide_by_zero is signaling
Warning: ieee_underflow is signaling
Warning: ieee_inexact is signaling
FORTRAN STOP

Code: Select all

Vasp/GaAs>cat INCAR 
ALGO = Fast
EDIFF = 1E-7
ENCUT = 520
IBRION = 2
ICHARG = 1
ISIF = 3
ISMEAR = -5
LORBIT = 11
LSORBIT = True
LREAL = False
LWAVE = True
NELM = 100
NSW = 0
PREC = Accurate
SIGMA = 0.05
LAECHG = True

GGA = PS

KPAR = 56


Re: Linking error compiling GPU version of Vasp 6.3.0

Posted: Tue Mar 29, 2022 3:37 pm
by marie-therese.huebsch
Hi paulfons,

it seems you are sorting out how to submit a job now. This thread has become quite long already and you have not answered the questions I have asked before:
1. Did you set LD_LIBRARY_PATH? I cannot see that you added the system library path to your .bashrc. So, did my previous suggestion help to solve the following error:
CODE: SELECT ALL
/data/Software/Vasp/vasp.6.3.0/bin/vasp_std: error while loading shared libraries: libqdmod.so.0: cannot open shared object file: No such file or directory
2. Regarding the new error:
CODE: SELECT ALL
/proj/nv/libraries/Linux_x86_64/22.2/openmpi/209518-rel-1/share/openmpi/help-mpi-runtime.txt: No such file or directory. Sorry!
Did you install OpenMPI yourself? And did you add openmpi/lib to your LD_LIBRARY_PATH?
Could you please respond so I can follow what is your current status?
For the other issues, I suggest that you open a new thread. I will be happy to help also in understanding KPAR etc.

Thank you for understanding.
Kind regards,
Marie-Therese

Re: Linking error compiling GPU version of Vasp 6.3.0

Posted: Wed Mar 30, 2022 4:56 am
by paulfons
Hi,
I am sorry for the delay in posting a response. I did enter an update, but I must have not posted it correctly. In any case, the GPU version of Vasp works. I learned (from the wiki) that only a single mpi process is allowed due to the NCCL libraries. This seems like a significant limitation for smaller systems, but hopefully it will be addressed in the future. In the meantime, I have been trying to learn how to optimize the throughput on my Ampere 100 card. I tried a few runs with vasp_gam for a MD simulation with a few hundred atoms. I tried varying NSIM and it seems like a bigger number than what I typically use with a cpu calculation is better (NSIM=40) gave the shortest time in my limited testing. Can you offer any insight on what the best parameters are (NSIM, ? others) for optimizing a GPU-based calculation? I assume for a system with a larger number of k-points that KPAR would be another parameter to vary. Is there any sort of "rulebook" for getting a handle on gpu calculation optimization?

Re: Linking error compiling GPU version of Vasp 6.3.0

Posted: Wed Mar 30, 2022 6:28 am
by marie-therese.huebsch
Hi paulfons,

Thank you for confirming that the GPU version for VASP works on your machine!

Therefore, I will close this topic now, as your follow-up questions do not fit the title "Linking error compiling GPU version of Vasp 6.3.0". Could you please ask your questions about optimization in a new thread with an appropriate title. I am very sorry this causes an inconvenience and I hope for your understanding.

Kind regards,
Marie-Therese