Queries about input and output files, running specific calculations, etc.
Moderators: Moderator, Global Moderator
-
dominic_varghese
- Newbie

- Posts: 16
- Joined: Tue Sep 30, 2025 2:58 pm
#1
Post
by dominic_varghese » Tue Jun 30, 2026 1:31 pm
Hi everyone,
I am running VASP/6.5.1-nvhpc-gpu to speed up my AIMD calculations on metal with dense k-grid on a V100 GPU. The following is my submit script :
Code: Select all
#SBATCH --gpus=v100-32:8
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=4
ulimit -s unlimited
module purge
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/packages/hdf5/hdf5-2.0.0/nvhpc/lib
module use -a /opt/packages/nvhpc/v25.5/modulefiles
module load nvhpc-hpcx-cuda12/25.5 intel-mkl/2023.2.0 VASP/6.5.1-nvhpc-gpu
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
mpirun -np 8 vasp_std > log
and the INCAR for the NPT run at 300K on a 2x2x2 supercell with 72 atoms:
Code: Select all
ISTART = 0
# Hardware & Performance
NCORE = 8
ALGO = Normal
PREC = Normal
NSIM = 16
# Electronic Optimization (Matched to paper)
ENCUT = 450
EDIFF = 1.0e-8
NELM = 100
NELMIN = 4
GGA = PS
# Smearing
ISMEAR = 0
SIGMA = 0.01
LREAL = A # (Projection operators: automatic)
ML_ISTART = 0 # Start from scratch
ISYM = 0 # Essential for MD
# --- MD & NPT Settings ---
IBRION = 0
NSW = 1000
POTIM = 1.0
TEBEG = 300
TEEND = 300
MDALGO = 3 # Langevin
ISIF = 3 # Variable cell (NPT)
# Friction coefficients
LANGEVIN_GAMMA = 10.0 10.0 10.0
LANGEVIN_GAMMA_L = 10.0
PMASS = 100
ML_LMLFF = .TRUE. # Enable Machine Learning Force Field
ML_MODE = TRAIN # Train on the fly
Is this the best way to get the maximum performance and speed-up from running the code on GPU compared to CPU?
Are there any suggestions/mistakes which I am making ?
Thanks
Dominic
-
michael_wolloch
- Global Moderator

- Posts: 234
- Joined: Tue Oct 17, 2023 10:17 am
#2
Post
by michael_wolloch » Tue Jun 30, 2026 2:32 pm
Dear Dominic,
Unfortunately, performance optimization is no easy task, and you will have to do some benchmark calculations to test out your settings.
Make sure you read our guide on Optimizing the parallelization.
To benchmark, I would use the same system you are trying to run MD on, but only do around 15-20 SCF steps and no ionic updates. Make sure that you only tweak one setting at a time (e.g., NSIM, or OMP_NUM_THREADS) and systematically go through combinations.
Some general points:
-
KPAR is for sure your best friend in increasing performance if you have more than 1 kpoint. 72 atoms is a pretty small system, so you might not run on the Gamma point only (please provide all input files as a zipped archive in the future as per the posting guidelines). So, if you have N kpoints, run on N GPUs (if possible) and use KPAR=N. Note that your VRAM usage will increase if you increase KPAR, so on 32GB V100 GPUs, this could become an issue at some point.
-
ALGO=Fast is usually better than ALGO=Normal for performance on GPU, especially if more GPUs are used. You will have to test for your system, however.
-
Try fewer GPUs if you don't have as many kpoints as GPUs. VASP generally has a hard time utilizing the compute of large GPUs in most calculations and is memory bandwidth bound.
-
NCORE will be set to 1 for all GPU runs internally. Don't bother with it.
-
The number of MPI ranks should always be equal to the number of GPUs you are running on.
-
Compile with NCCL if you have not already done so.
-
Your SIGMA is pretty small; you might need fewer SCF steps per ionic step if you increase it. But be sure to check if your entropy term is still reasonable.
Let me know if you have more questions,
Cheers, Michael