Testing GPU build

Message

jeff_nucciarone · #1 Post by **jeff_nucciarone** » Tue Sep 16, 2025 6:00 pm

I built a version of VASP 6.3.1 to enable use of GPUs. My makefile.include was based upon the arch/makefile.include.nvhpc_ompi_mkl_omp_acc, however I disabled the OpenMP directives. I built on a node with an a100 GPU. I was using Intel mkl/2021.4.0 and nvhpc/21.9

I ran make all and received 3 executables: vasp_gam, vasp_ncl, vasp_std

I assume these were all GPU enabled.

However a test run showed no evidence the GPU was used at all. While running the test case, repeated runs of nvidia-smi indicated zero GPU activity.

Did I need to do something else fir the build, or perhaps my test case was not a proper one for GPU testing?

Is there a good test case for testing GPU usage?

#2 Post by **max_liebetreu** » Wed Sep 17, 2025 7:40 am

Hello Jeff,

Thank you for reaching out to us, and welcome to the VASP Forum!

In order to be able to help you, could we ask you:

Did you follow the guide for compiling VASP with GPU support here?
Could you provide a minimal reproducible example for your test case?
Could you show us your makefile.include?

Best regards,

jeff_nucciarone · #3 Post by **jeff_nucciarone** » Wed Sep 17, 2025 2:12 pm

Thanks for your reply.

To answer your questions, I did follow the guide, but I will double check to make sure I hit all the steps and requirements.

For my makefile.include, I am not sure how to attach a file here, so I will put it below. Note that this is a bare minimum with no HDF5 or WANNIER90.

Code: Select all

# Default precompiler options
CPP_OPTIONS = -DHOST=\"LinuxNV\" \
              -DMPI -DMPI_BLOCK=8000 -Duse_collective \
              -DscaLAPACK \
              -DCACHE_SIZE=4000 \
              -Davoidalloc \
              -Dvasp6 \
              -Duse_bse_te \
              -Dtbdyn \
              -Dqd_emulate \
              -Dfock_dblbuf \
              -D_OPENMP

CPP         = nvfortran -Mpreprocess -Mfree -Mextend -E $(CPP_OPTIONS) $*$(FUFFIX)  > $*$(SUFFIX)

FC          = mpif90 -mp
FCL         = mpif90 -mp -c++libs

FREE        = -Mfree

FFLAGS      = -Mbackslash -Mlarge_arrays

OFLAG       = -fast

DEBUG       = -Mfree -O0 -traceback

OBJECTS     = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o

# Redefine the standard list of O1 and O2 objects
SOURCE_O1  := pade_fit.o
SOURCE_O2  := pead.o

# For what used to be vasp.5.lib
CPP_LIB     = $(CPP)
FC_LIB      = nvfortran
CC_LIB      = nvc -w
CFLAGS_LIB  = -O
FFLAGS_LIB  = -O1 -Mfixed
FREE_LIB    = $(FREE)

OBJECTS_LIB = linpack_double.o

# For the parser library
CXX_PARS    = nvc++ --no_warnings

##
## Customize as of this point! Of course you may change the preceding
## part of this file as well if you like, but it should rarely be
## necessary ...
##
# When compiling on the target machine itself , change this to the
# relevant target when cross-compiling for another architecture
VASP_TARGET_CPU ?= -tp host
FFLAGS     += $(VASP_TARGET_CPU)

# Specify your NV HPC-SDK installation (mandatory)
#... first try to set it automatically
#NVROOT      =$(shell which nvfortran | awk -F /compilers/bin/nvfortran '{ print $$1 }')
#
# Emperically determined to be this
#
NVROOT       = /swst/apps/nvhpc/21.9_gcc-8.5.0/Linux_x86_64/21.9
#
# If the above fails, then NVROOT needs to be set manually
#NVHPC      ?= /opt/nvidia/hpc_sdk
#NVVERSION   = 21.11
#NVROOT      = $(NVHPC)/Linux_x86_64/$(NVVERSION)

# Software emulation of quadruple precsion (mandatory)
QD         ?= $(NVROOT)/compilers/extras/qd
LLIBS      += -L$(QD)/lib -lqdmod -lqd
INCS       += -I$(QD)/include/qd

# Intel MKL for FFTW, BLAS, LAPACK, and scaLAPACK
#MKLROOT    ?= /path/to/your/mkl/installation
LLIBS_MKL   = -Mmkl -L$(MKLROOT)/lib/intel64 -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64
INCS       += -I$(MKLROOT)/include/fftw

# Use a separate scaLAPACK installation (optional but recommended in combination with OpenMPI)
# Comment out the two lines below if you want to use scaLAPACK from MKL instead
#SCALAPACK_ROOT ?= /path/to/your/scalapack/installation
#LLIBS_MKL   = -L$(SCALAPACK_ROOT)/lib -lscalapack -Mmkl

LLIBS      += $(LLIBS_MKL)

These are the. linked libraries in the executable:

Code: Select all

$ ldd  proj/VASP/vasp.6.3.1/bin/vasp_std 
	linux-vdso.so.1 (0x000014be90439000)
	libqdmod.so.0 => /swst/apps/nvhpc/21.9_gcc-8.5.0/Linux_x86_64/21.9/compilers/extras/qd/lib/libqdmod.so.0 (0x000014be8ffee000)
	libqd.so.0 => /swst/apps/nvhpc/21.9_gcc-8.5.0/Linux_x86_64/21.9/compilers/extras/qd/lib/libqd.so.0 (0x000014be8fdaa000)
	libmkl_scalapack_lp64.so.1 => /swst/apps/intel-oneapi-mkl/2021.4.0_gcc-8.5.0/mkl/2021.4.0/lib/intel64/libmkl_scalapack_lp64.so.1 (0x000014be8f67d000)
	libmkl_blacs_openmpi_lp64.so.1 => /swst/apps/intel-oneapi-mkl/2021.4.0_gcc-8.5.0/mkl/2021.4.0/lib/intel64/libmkl_blacs_openmpi_lp64.so.1 (0x000014be903e9000)
	libmpi_usempif08.so.40 => /storage/icds/swst/deployed/production/20220813/apps/nvhpc/21.9_gcc-8.5.0/Linux_x86_64/21.9/comm_libs/openmpi/openmpi-3.1.5/lib/libmpi_usempif08.so.40 (0x000014be8f454000)
	libmpi_usempi_ignore_tkr.so.40 => /storage/icds/swst/deployed/production/20220813/apps/nvhpc/21.9_gcc-8.5.0/Linux_x86_64/21.9/comm_libs/openmpi/openmpi-3.1.5/lib/libmpi_usempi_ignore_tkr.so.40 (0x000014be8f24f000)
	libmpi_mpifh.so.40 => /storage/icds/swst/deployed/production/20220813/apps/nvhpc/21.9_gcc-8.5.0/Linux_x86_64/21.9/comm_libs/openmpi/openmpi-3.1.5/lib/libmpi_mpifh.so.40 (0x000014be8f002000)
	libmpi.so.40 => /storage/icds/swst/deployed/production/20220813/apps/nvhpc/21.9_gcc-8.5.0/Linux_x86_64/21.9/comm_libs/openmpi/openmpi-3.1.5/lib/libmpi.so.40 (0x000014be8ebc3000)
	libmkl_intel_lp64.so.1 => /swst/apps/intel-oneapi-mkl/2021.4.0_gcc-8.5.0/mkl/2021.4.0/lib/intel64/libmkl_intel_lp64.so.1 (0x000014be8e024000)
	libmkl_intel_thread.so.1 => /swst/apps/intel-oneapi-mkl/2021.4.0_gcc-8.5.0/mkl/2021.4.0/lib/intel64/libmkl_intel_thread.so.1 (0x000014be8a8d5000)
	libmkl_core.so.1 => /swst/apps/intel-oneapi-mkl/2021.4.0_gcc-8.5.0/mkl/2021.4.0/lib/intel64/libmkl_core.so.1 (0x000014be86467000)
	libacchost.so => /storage/icds/swst/deployed/production/20220813/apps/nvhpc/21.9_gcc-8.5.0/Linux_x86_64/21.9/compilers/lib/libacchost.so (0x000014be861f7000)
	libdl.so.2 => /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libdl.so.2 (0x000014be85ff3000)
	libnvhpcatm.so => /storage/icds/swst/deployed/production/20220813/apps/nvhpc/21.9_gcc-8.5.0/Linux_x86_64/21.9/compilers/lib/libnvhpcatm.so (0x000014be85de8000)
	libstdc++.so.6 => /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libstdc++.so.6 (0x000014be85a53000)
	libnvf.so => /storage/icds/swst/deployed/production/20220813/apps/nvhpc/21.9_gcc-8.5.0/Linux_x86_64/21.9/compilers/lib/libnvf.so (0x000014be8541e000)
	libnvomp.so => /storage/icds/swst/deployed/production/20220813/apps/nvhpc/21.9_gcc-8.5.0/Linux_x86_64/21.9/compilers/lib/libnvomp.so (0x000014be847a4000)
	libpthread.so.0 => /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libpthread.so.0 (0x000014be84584000)
	libnvcpumath.so => /storage/icds/swst/deployed/production/20220813/apps/nvhpc/21.9_gcc-8.5.0/Linux_x86_64/21.9/compilers/lib/libnvcpumath.so (0x000014be8414f000)
	libnvc.so => /storage/icds/swst/deployed/production/20220813/apps/nvhpc/21.9_gcc-8.5.0/Linux_x86_64/21.9/compilers/lib/libnvc.so (0x000014be83ef7000)
	librt.so.1 => /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/librt.so.1 (0x000014be83cef000)
	libm.so.6 => /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libm.so.6 (0x000014be8396d000)
	libc.so.6 => /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libc.so.6 (0x000014be83596000)
	libgcc_s.so.1 => /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libgcc_s.so.1 (0x000014be8337e000)
	libatomic.so.1 => /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libatomic.so.1 (0x000014be83176000)
	libopen-rte.so.40 => /storage/icds/swst/deployed/production/20220813/apps/nvhpc/21.9_gcc-8.5.0/Linux_x86_64/21.9/comm_libs/openmpi/openmpi-3.1.5/lib/libopen-rte.so.40 (0x000014be82e32000)
	libopen-pal.so.40 => /storage/icds/swst/deployed/production/20220813/apps/nvhpc/21.9_gcc-8.5.0/Linux_x86_64/21.9/comm_libs/openmpi/openmpi-3.1.5/lib/libopen-pal.so.40 (0x000014be82969000)
	librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x000014be8274e000)
	libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x000014be8252e000)
	libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x000014be82322000)
	libutil.so.1 => /usr/lib64/libutil.so.1 (0x000014be8211e000)
	libz.so.1 => /usr/lib64/libz.so.1 (0x000014be81f06000)
	/lib64/ld-linux-x86-64.so.2 (0x000014be9020c000)
	libnl-3.so.200 => /usr/lib64/libnl-3.so.200 (0x000014be81ce3000)
	libnl-route-3.so.200 => /usr/lib64/libnl-route-3.so.200 (0x000014be81a51000)

For the test case, this was the INCAR file I used.

Code: Select all

SYSTEM = MoS2_Fe
ISTART = 0; ICHARG = 2
ISMEAR = 0; SIGMA = 0.1
ENCUT = 400
ISIF = 2; IBRION = 2; POTIM = 0.2; NSW = 200
EDIFF = 1E-5; EDIFFG = -1E-2
PREC = High;
LREAL = Auto; 
LWAVE = False; LCHARG = False

Please let me know if you require any additional information.

#4 Post by **max_liebetreu** » Thu Sep 18, 2025 11:52 am

Hello,

Thank you for the files. You can upload attachments by switching from the "Options" to "Attachments" tab underneath the editor window, or drag & drop files directly into your post. Are you running part of the testsuite, or a custom setup? I might have misunderstood your original remark about testing.

I have some additional questions:

Which command are you using to run your test case, exactly?
Did you run make veryclean before compiling VASP with GPU support?

Best regards,

jeff_nucciarone · #5 Post by **jeff_nucciarone** » Fri Sep 19, 2025 2:15 pm

The INCAR file is what I used for testing. As part of my question, I questioned whether this was even an appropriate test case. I assume it is not, and would appreciate guidance on finding something suitable that demonstrates the GPU(s) were used. I'm not even testing scaling just yet, we just want to know whether it works or not.

A secondary issue is that our cluster might not be offloading to the GPU properly, which is also why I inquired if my "makefile.include" was even close to correct.

I used the compiled "vasp_std" executable.

To rephrase my question:

We are not seeing the GOPUs being used when I run a version of VASP compiled for GPU. I'm not sure if I built it incorrectly, our cluster has issues, or my test case is inappropriate for exercising the GPU. With your assistance, I can determine which of the above is the cause of what we are experiencing. I hope this clears that up.

Thanks,

--Jeff

#6 Post by **max_liebetreu** » Mon Sep 22, 2025 11:20 am

Hello Jeff,

I reckon some amount of GPU usage should show up for any VASP calculation that utilizes the GPU. Under that assumption, your test case should be sufficient.

In your case, the makefile.include is the culprit - we noticed that you are compiling VASP without OpenACC (-D_OPENACC) rather than without OpenMP (-D_OPENMP). Therefore, VASP will only be utilizing the CPU.

We strongly recommend you simply use the makefile.include we provide in arch - to disable OMP threading, the better way is to set the global OMP_NUM_THREADS=1. If compilation speed is a concern, and you are only compiling for your A100, you can substitute

Code: Select all

FC          = mpif90 -acc -gpu=cc60,cc70,cc80,cuda11.0 -mp
FCL         = mpif90 -acc -gpu=cc60,cc70,cc80,cuda11.0 -mp -c++libs

with

Code: Select all

FC          = mpif90 -acc -gpu=cc80,cuda11.0 -mp
FCL         = mpif90 -acc -gpu=cc80,cuda11.0 -mp -c++libs

We hope that helps!
Best regards,

jeff_nucciarone · #7 Post by **jeff_nucciarone** » Mon Sep 22, 2025 4:39 pm

Thank you for the update. I did use the example from the arch directory, so, interestingly, some of the required features were missing. I will double-check, recompile, and post the results here after I've made progress. This may take me a few days as I am backed up on projects.

jeff_nucciarone · #8 Post by **jeff_nucciarone** » Thu Sep 25, 2025 2:00 pm

I was finally able to allocate a GPU node, but have run into some more difficulties. I suspect our environment is the cause.

Using the makefile.include example from arch/, I attempted a recompilation but I ran into issues with cublas and cufft not being available. I manually added the appropriate -L and -l to the FCL line:

-L /storage/icds/RISE/sw8/cuda/cuda-13.0.0/targets/x86_64-linux/lib -l cufft -l cusolver -l cublas -l nccl

However, I immediately encountered an issue with NCCL. It turns out we do not have NCCL installed on our system.

Going back to the makefile, I see

-DUSENCCL -DUSENCCLP2P

Are these required? Can I leave off these two definitions and not use NCCL? Installing NCCL on our cluster may be a heavier lift than you think.

Apologies if this is a basic question; I am a novice when it comes to GPU compilation.

I am attaching my makefile.include

Thank you.

jeff_nucciarone · #9 Post by **jeff_nucciarone** » Thu Sep 25, 2025 5:44 pm

A reply to my reply.

I was able to leave off the NCCL define directives. I had to make a few other changes, but VASP compiled successfully.

I had to change the compile and link to:

FC = mpif90 -cudalib=cublas -acc -gpu=cc80,cuda11.0 -mp
FCL = mpif90 -cudalib=cublas -acc -gpu=cc80,cuda11.0 -mp -c++libs -L/swst/apps/cuda/11.5.0_gcc-8.5.0/lib64 -lcufft -lcusolver -lcublas

Here is the nvidia-smi output, and I think this looks good:

Thu Sep 25 13:37:59 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-PCIE-40GB On | 00000000:3B:00.0 Off | 0 |
| N/A 37C P0 93W / 250W | 29074MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3388970 C vasp_std 6944MiB |
| 0 N/A N/A 3388971 C vasp_std 7012MiB |
| 0 N/A N/A 3388972 C vasp_std 7012MiB |
| 0 N/A N/A 3388973 C vasp_std 7976MiB |
+-----------------------------------------------------------------------------------------+

As to whether this gives a performance boost or not will depend on the test case. But this is a much-improved output from nvidia-smi.

Unless you have other recommendations, I consider this a successful conclusion, and I thank you very much for your assistance getting me on the correct path.

#10 Post by **max_liebetreu** » Fri Sep 26, 2025 8:16 am

Hello Jeff,

Happy to hear everything is working now!

A few notes from our side:

We do consider the -DUSENCCL & -DUSENCCLP2P flags mandatory when compiling for GPUs. I am not sure what happens when you circumvent this, but according to the wiki, this might occasionally either run into performance issues or crash, or be negligible anyway (but we have not tested this).
However, NCCL is part of the NVIDIA HPC-SDK, so we think you should already have access to it. Is it possible NVROOT is not set correctly, or that some module is not loaded? We are a bit surprised why NCCL and some other modules could not be found and had to be linked differently.
You should probably also link in SCALAPACK from MKL.
If indeed everything is working as expected on your end, then consider this message null and void. We just wanted to point out that running without NCCL has not been tested, and changes to the makefile.include come with the caveat of likely being untested in general.

Best regards,

jeff_nucciarone · #11 Post by **jeff_nucciarone** » Fri Sep 26, 2025 1:23 pm

Thanks for the update. My test ran, but eventually timed out due to an insufficient wall clock time specification. It did look like it made progress, but I can't determine the performance as I don't have a baseline. I turned that part over to a colleague who is more familiar with the test case.

I will go back and check on NCCL. It could be that our installer did not add the required LIBPATH and LD_LIBRARY_PATH settings. I had to put the cuda 11 path into the link command manually.

Thanks for the ScaLAPACK tip, I will also try incorporating that. I'll need to play around with my MKL link settings.

As always, I will keep you posted.

jeff_nucciarone · #12 Post by **jeff_nucciarone** » Fri Sep 26, 2025 1:59 pm

Another update to my update.

I did find nccl, it turns out whoever installed our NVHPC did not include the library path. I have corrected that and will give it a try once a GPU node comes free.

I did have a question regarding ScaLAPACK.

This is in my maefile.include

# Intel MKL for FFTW, BLAS, LAPACK, and scaLAPACK
LLIBS_MKL = -Mmkl -L$(MKLROOT)/lib/intel64 -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64
INCS += -I$(MKLROOT)/include/fftw

Does this not include ScaLAPACK? Am I missing a library to include? I ahven't wandered over to the Intel link instruction builder page just yet, I'm hoping you have a quick answer for me.

Thanks again.

#13 Post by **max_liebetreu** » Mon Sep 29, 2025 10:09 am

Hello,

No, you are correct - that line already includes ScaLAPACK. We simply overlooked that line in your initial makefile.include.
As long as MKLROOT is also set, this should be fine.

Is it possible that you are running 4 ranks on one single device, according to the NVIDIA-SMI output? If so, that's currently not officially supported. You could send us a minimal reproducible example with an OUTCAR, then we could have a look at it.

It might be a good idea to test your setup on a medium-size system with one or a few k-points, so that the computation runs for a few minutes on your CPU; then test with one MPI rank on your GPU.

You might also want to check your std_out for something like this:

Code: Select all

running    2 mpi-ranks, with    8 threads/rank, on    1 nodes
distrk:  each k-point on    2 cores,    1 groups
distr:  one band on    1 cores,    2 groups
Offloading initialized ...    2 GPUs detected

...though obviously the numbers will differ for your case.

Hope that helps!
Best regards,

VASP Forum

Testing GPU build

Testing GPU build

Re: Testing GPU build

Re: Testing GPU build

Re: Testing GPU build

Re: Testing GPU build

Re: Testing GPU build

Re: Testing GPU build

Re: Testing GPU build

Re: Testing GPU build

Re: Testing GPU build

Re: Testing GPU build

Re: Testing GPU build

Re: Testing GPU build