Hi,
I am experiencing a persistent synchronization issue when running VASP 6.5.1 GPU builds compiled with the NVHPC 25.5 toolchain.
The build uses nvfortran 25.5 with CUDA 12.9, and the bundled OpenMPI 4.1.5 (UCX + vader + tcp).
When running large finite-difference frequency calculations on 3 or 4 GPUs, the job runs normally for approximately 12–48 hours and then consistently hangs.
After the hang occurs, one GPU drops to 0% while the rest of the GPUs remain at 100% utilization.
All GPUs retain their allocated VRAM, and no further progress is made in the calculation.
There are no ECC errors or Xid messages recorded in the kernel logs.
Terminating the job and inspecting backtraces indicates that the hang results from a cross-rank deadlock between CUDA kernels and MPI collectives.
Here is the rank dump for one of the busy GPUs:
#0 cuStreamSynchronize() from libcuda.so.1
#1 __pgi_uacc_cuda_wait() at cuda_wait.c:82
#2 m_sumb_d() at mpi.f90:1700
#3 gridq::sumrl_gq() ...
#4 charge::soft_charge() ...
#5 elmin() ...
And here is the rank dump for the idle GPU:
#0 mca_btl_vader_component_progress()
#1 opal_progress()
#2 PMPI_Allreduce() [coll/cuda/coll_cuda_allreduce.c:63]
#3 m_sumb_d() at mpi.f90:1731
The job runs entirely within a single node and does not use InfiniBand (TCP-only communication).
Even after explicitly disabling NCCL and GPUDirect through NCCL_IB_DISABLE=1 and NCCL_P2P_DISABLE=1 in the slurm sbatch script.
Is there an "official" way of setting the runtime to force correct CUDA and MPI behavior in GPU builds of VASP without manually setting any MCA parameters?
Regards,
Zhiyuan

