Long-run GPU hang and MPI crash with NVHPC runtime

Message

Zhiyuan Yin · #1 Post by **Zhiyuan Yin** » Thu Oct 09, 2025 9:20 pm

Hi,

I am experiencing a persistent synchronization issue when running VASP 6.5.1 GPU builds compiled with the NVHPC 25.5 toolchain.
The build uses nvfortran 25.5 with CUDA 12.9, and the bundled OpenMPI 4.1.5 (UCX + vader + tcp).

When running large finite-difference frequency calculations on 3 or 4 GPUs, the job runs normally for approximately 12–48 hours and then consistently hangs.
After the hang occurs, one GPU drops to 0% while the rest of the GPUs remain at 100% utilization.
All GPUs retain their allocated VRAM, and no further progress is made in the calculation.
There are no ECC errors or Xid messages recorded in the kernel logs.
Terminating the job and inspecting backtraces indicates that the hang results from a cross-rank deadlock between CUDA kernels and MPI collectives.

Here is the rank dump for one of the busy GPUs:
#0 cuStreamSynchronize() from libcuda.so.1
#1 __pgi_uacc_cuda_wait() at cuda_wait.c:82
#2 m_sumb_d() at mpi.f90:1700
#3 gridq::sumrl_gq() ...
#4 charge::soft_charge() ...
#5 elmin() ...

And here is the rank dump for the idle GPU:
#0 mca_btl_vader_component_progress()
#1 opal_progress()
#2 PMPI_Allreduce() [coll/cuda/coll_cuda_allreduce.c:63]
#3 m_sumb_d() at mpi.f90:1731

The job runs entirely within a single node and does not use InfiniBand (TCP-only communication).
Even after explicitly disabling NCCL and GPUDirect through NCCL_IB_DISABLE=1 and NCCL_P2P_DISABLE=1 in the slurm sbatch script.

Is there an "official" way of setting the runtime to force correct CUDA and MPI behavior in GPU builds of VASP without manually setting any MCA parameters?

Regards,
Zhiyuan

#2 Post by **martin.schlipf** » Fri Oct 10, 2025 6:33 am

Thank you for this report, I will get in touch with our GPU experts and then get back to you.

#3 Post by **martin.schlipf** » Fri Oct 10, 2025 8:09 am

Do you experience this behavior for many different systems or is that specific to this particular one? In the latter case, please provide us with the input files to run the calculation so that we can try to reproduce the issue. Could you also send us the makefile.include that you used to compile your code, please?

If you want to explore for yourself, you can remove -DUSENCCL from the makefile.include. That will lead to the code running through a different code path. You could also try using an older toolchain to see if the issue is specific to that compiler version.

Zhiyuan Yin · #4 Post by **Zhiyuan Yin** » Fri Oct 10, 2025 8:11 pm

Dear Martin,

Here is my makefile.include and my job script, and input files:

gpubug.zip

Regards,
Zhiyuan

#5 Post by **martin.schlipf** » Mon Oct 13, 2025 7:36 am

I looked into your makefile.include and I noticed two things

It looks like you are using a very recent OpenMPI 5 version. We do not have extensive experience with it yet, perhaps you can try an 4.1.X release which is better tested to work with VASP.
There is a section where you set LINK with rpath that is not present in the archetypal makefile.include. Perhaps that causes an issue?

Zhiyuan Yin · #6 Post by **Zhiyuan Yin** » Tue Oct 14, 2025 2:30 am

Dear Martin,

I rewrote the makefile.include to enable the compiler to parse the correct dependencies (lib from lapack, scalapack, fftw3, hdf5, etc.). The default makefile.include causes the compiler to hard-code these paths under, for example, vaspfolder/../../nameofdeps/lib, which caused errors during compilation. I'm unsure whether the issue arises from using a standalone OMPI5 compiled by NVHPC 25.5 as a runtime to circumvent the incompatibility between srun and NVHPC. This is because the srun incompatibility is a known issue that was observed on at least one of our private clusters, as well as the SDSC Expanse cluster.

Anyway, these days, I am tuning the runtime settings. I observed much less chance of instability when I did not touch any CUDA/coll runtime at all (just env+mpirun).
For example, when I commented out these settings in the script, there has been no observed syncing error:
# Transport layer
#export OMPI_MCA_pml=ob1
#export OMPI_MCA_btl=self,vader,tcp
#export OMPI_MCA_btl_smcuda_use_cuda_ipc=0
#export OMPI_MCA_opal_cuda_support=false
#export OMPI_MCA_mpi_leave_pinned=0
#export OMPI_MCA_btl_base_verbose=0

# Disable CUDA-aware collectives
#export OMPI_MCA_coll_cuda_use_allreduce=0
#export OMPI_MCA_coll_cuda_priority=0
#export OMPI_MCA_coll_tuned_use_dynamic_rules=1
#export OMPI_MCA_coll_tuned_allreduce_algorithm=4 # CPU ring

# Disable NCCL and GPUDirect
#export NCCL_IB_DISABLE=1
#export NCCL_P2P_DISABLE=1
#export NCCL_LAUNCH_MODE=GROUP

# Optional: Vader safety net (avoid CMA/XPMEM)
#export OMPI_MCA_btl_vader_single_copy_mechanism=none
#export OMPI_MCA_coll=^cuda

This has been running for days and is still running normally, at least for now (about 3 days).

What is the default behavior of VASP when I do not set them explicitly? If I set the runtime as what I wrote in the job script in the previous attachment, will the runtime use any other unexpected pathway that causes instability?

Regards,
Zhiyuan

#7 Post by **martin.schlipf** » Tue Oct 14, 2025 10:33 am

Glad that the calculation seems to be running now.

VASP does not set any values for these variables, so the settings used will depend on the setup of your cluster.

If the problem reappears, please let us know. There is also now a NVIDIA specific MPI implementation if you want to try to achieve the maximum possible performance but we have not explored that systematically yet.

Zhiyuan Yin · #8 Post by **Zhiyuan Yin** » Thu Oct 16, 2025 12:24 am

Dear Martin,

Unfortunately, after 4 days of normal operation, the issue persists, with the device still experiencing the same hanging problem. What will you suggest for now? Do I need to recompile the VASP using the original template and revert to the OMPI4 runtime?

Regards,
Zhiyuan

#9 Post by **martin.schlipf** » Thu Oct 16, 2025 6:21 am

Did you already explore the following suggestions? I would also explore these in this order.

Compile the code without -DUSENCCL. That will lead to a different communication between GPUs and perhaps resolve your issue.
Link the code to openmpi.4.1.X. We use nvhpc 25.1 and openmpi 4.1.7 in our CI so that is the best tested.
Use NVIDIA's own MPI implementation. That is possibly the largest effort but is probably best optimized for their GPUs.

Zhiyuan Yin · #10 Post by **Zhiyuan Yin** » Tue Oct 28, 2025 7:21 pm

Dear Martin,

On the SDSC Expanse cluster, I recompiled the code using nvhpc 25.7 and linked it with OpenMPI 4.1.3. Based on our testing, geometry relaxation and MD have no issues. However, the finite-difference phonon calculation (IBRION=5) has problems when running on large-scale parallelization (multiple nodes), while the intranode pathway is always fine. For multiple-node parallelization, what runtime setting would you generally suggest? If you cannot provide more guidance on runtime tuning, I will try to recompile VASP without NCCL support as a last resort.

The desync problem always occurs within an ionic loop, but I cannot determine whether it occurs during an electronic step or between electronic steps, on the CPU side.

Again, here is my job script for Expanse.

#!/bin/bash
#SBATCH --job-name=
#SBATCH --partition=
#SBATCH --nodes=1 #nodes
#SBATCH --tasks-per-node=4
#SBATCH --gpus=4 # or 8
#SBATCH --cpus-per-task=16 #OMP
#SBATCH --mem=300GB
#SBATCH --account=
#SBATCH -t 48:00:00
#SBATCH --export=ALL

module purge
module load slurm
module load gpu/0.17.3b

export LD_LIBRARY_PATH=/.../hdf5-nvhpc/install/lib:$LD_LIBRARY_PATH
export PATH=/.../vasp651/.../bin:$PATH

export PATH=/.../openmpi-4.1.3/bin:$PATH
export LD_LIBRARY_PATH=/.../nvhpc-25.7/Linux_x86_64/25.7/compilers/extras/qd/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/.../intel-mkl-2020.4.304-.../compilers_and_libraries//linux/mkl/lib/intel64_lin:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/.../gcc-8.4.0/pmix-3.2.1-.../lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/.../gcc-8.4.0/hwloc-.../lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/...openmpi-4.1.3/lib:$LD_LIBRARY_PATH

filename=

# Copy and link to INCAR
mkdir -p $HOME/INCARS
cp INCAR $HOME/INCARS/INCAR.$SLURM_JOBID
mv INCAR INCAR.$SLURM_JOBID.orig
ln -s $HOME/INCARS/INCAR.$SLURM_JOBID INCAR

export MKL_THREADING_LAYER=INTEL
export OMP_NUM_THREADS=16
export OMPI_MCA_btl_openib_allow_ib=1
export OMPI_MCA_btl_openib_if_include="mlx5_0:1"

#srun -n 8 --gpus=8 vasp_gam >> $filename.out 2>&1
srun -n 4 --gpus=4 vasp_gam >> $filename.out 2>&1

Regards,
Zhiyuan

#11 Post by **martin.schlipf** » Fri Nov 28, 2025 10:14 am

Sorry for the late reply, somehow I missed the new response to this topic.

Thank you for localizing the issue to the phonon calculations that helps debugging it.

Can you see if you can reproduce the issue on smaller systems?
You mentioned that it takes 12-48 hours until the desync occurs. Are these times for different systems or for the same system when repeating the same calculation? Asked differently, does the desync always happen at the same point of the execution?

Unfortunately, I cannot see a quick fix to this problem. In the next VASP release, we plan to release a checkpointing logic for phonon calculations that would at least help continue the calculation after a desync occurred. For now, the only advice I can offer is to run without NCCL and see whether that addresses the issue.

VASP Forum

Long-run GPU hang and MPI crash with NVHPC runtime

Long-run GPU hang and MPI crash with NVHPC runtime

Re: Long-run GPU hang and MPI crash with NVHPC runtime

Re: Long-run GPU hang and MPI crash with NVHPC runtime

Re: Long-run GPU hang and MPI crash with NVHPC runtime

Re: Long-run GPU hang and MPI crash with NVHPC runtime

Re: Long-run GPU hang and MPI crash with NVHPC runtime

Re: Long-run GPU hang and MPI crash with NVHPC runtime

Re: Long-run GPU hang and MPI crash with NVHPC runtime

Re: Long-run GPU hang and MPI crash with NVHPC runtime

Re: Long-run GPU hang and MPI crash with NVHPC runtime

Re: Long-run GPU hang and MPI crash with NVHPC runtime