VASP CPU vs GPU

Message

chamara_somarathna · #1 Post by **chamara_somarathna** » Wed Aug 06, 2025 9:05 pm

Hi all,

I am doing a scaling test for Silicon QMD comparing CPU vs GPU performance.

The outcome was a bit confusing for me. My system is 512 atoms of Silicon, and I am using the gamma-only executable.

On CPUs, I got the best performance ~8 seconds/step with 16 nodes (2x Intel Xeon Max 9470 per node).
On GPUs, I got ~13 seconds/step with 1 node (4x Nvidia A100). If I increase the number of nodes, VASP does not even go to the initial SCF loop. Believe it's highly inefficient with more nodes or even with one node.

I am trying to troubleshoot this, as GPUs should perform better or at least perform similarly to CPUs. My friend did a similar test with similar parameters but for a FeS system with 512 atoms. He was getting roughly the same timing (~60 seconds/step) in both CPUs and GPUs.

I would like to have your valuable opinion on this, how can I utilize the GPUs much more efficiently for my Silicon system?

Thank you very much,
Chamara

#2 Post by **ahampel** » Thu Aug 07, 2025 8:05 am

Hi,

thank you for reaching out to us on the official VASP forum.

This is quite a delicate topic that requires some more information before I can answer or give good advice. In general we tested the nvidia gpu port thoroughly for scaling. For example we have Si 256 atom cell that we run with HSE06 and found near perfect scaling from 1-8 gpus. Importantly we found that it does not matter if those 8 GPUs reside on 1 node or 1 GPU each on 8 nodes. This should address one of your concerns that a multinode job somehow does not even start. This points maybe to another problem.

1)
Can you share a bit how you run the calculations? It is very important that you enable NCCL support when compiling VASP: "-DUSENCCL" for efficient comm between MPI ranks. Also make sure that the for nvhpc mpi these variables should be set to make GPU MPI aware:

Code: Select all

export MPICH_GPU_SUPPORT_ENABLED=1
export PMPI_GPU_AWARE=1

Did you bind mpi ranks correctly to cores and placed omp threads close to them?

2)
Now even if all this is correctly done there are still some caveats. First, to find best performance it is probably best to not run a full MD calculation but rather run one SCF calculation and check its runtime via the "LOOP+" counter. This should be were most of the time is spend in the run anyway.

3)
What kind of ALGO are you using for the electronic steps? In general one can say that for big unit cells GPUs can show their strengths but this comes with a caveat: The algo to perform the electronic steps. For example in the RMM or DAV algo only a few orbitals are optimized at the time. Hence, even though you have many KS orbitals only a few a used in LAPACK routines. This is most extreme in the RMM algo with many small matrix operations done. Hence, this is an algorithm decided for CPUs from the early times of VASP. Also for these kind of calculations VASP is often memory bandwidth limited. This completely changes when you run a meta GGA calculation like HSE or you run multiple Kpoints and use KPAR to parallelize over kpoints. HSE calculations are much better suited for GPUs compared to normale PBE / LDA calculations. Can you share some details what kind of job you are running?

All that said, I am advocating to check whether the calculation you are trying to do can actually benefit from more MPI ranks / more GPUs that well. I would like to see how 1,2,4, and 8 GPUS compare for one SCF cycle (LOOP+ count).

Just to illustrate I run an scf calculation for a 384 atom cell (gamma only) with vanilla PBE using the DAV algo (ALGO=Normal). I run this either on a AMD 96 core Milan processor or Nvidia A30 GPUs:

24x AMD EPYC: LOOP+: real time 342.5 sec
48x AMD EPYC: LOOP+: real time 263.9 sec
96x AMD EPYC: LOOP+: real time 389.6 sec

1x A30 (6 thr): LOOP+: real time 164.6 sec
2x A30 (6 thr): LOOP+: real time 111.5 sec

As you can see there is a sweet spot for this ALGO where more cores not necessarily bring more performance (since this is memory limit, more nodes will help still of course). And here you can also see that a single A30 outperforms this decent server CPU by far. And the A30 is a bit dated by now.

Try the scaling test for your GPUs and let me know the results. I am happy to help you get to the bottom of this.

Best,
Alex

chamara_somarathna · #3 Post by **chamara_somarathna** » Sat Aug 09, 2025 6:24 am

Hi Alex,

Thank you very much for your very detailed reply. I was out of the office for a couple of days, so I wasn't able to get back to you. Let me go through it, do some tests as you suggested, and provide all the details. I really appreciate your help.

Best,
Chamara

#4 Post by **ahampel** » Tue Feb 24, 2026 9:10 am

Hi Chamara,

does the problem persist, or did you encounter any more performance issues? Let me know otherwise I will close this issue soon.

Best,
Alex

VASP Forum

VASP CPU vs GPU

VASP CPU vs GPU

Re: VASP CPU vs GPU

Re: VASP CPU vs GPU

Re: VASP CPU vs GPU