Hi,
thank you for reaching out to us on the official VASP forum.
This is quite a delicate topic that requires some more information before I can answer or give good advice. In general we tested the nvidia gpu port thoroughly for scaling. For example we have Si 256 atom cell that we run with HSE06 and found near perfect scaling from 1-8 gpus. Importantly we found that it does not matter if those 8 GPUs reside on 1 node or 1 GPU each on 8 nodes. This should address one of your concerns that a multinode job somehow does not even start. This points maybe to another problem.
1)
Can you share a bit how you run the calculations? It is very important that you enable NCCL support when compiling VASP: "-DUSENCCL" for efficient comm between MPI ranks. Also make sure that the for nvhpc mpi these variables should be set to make GPU MPI aware:
Code: Select all
export MPICH_GPU_SUPPORT_ENABLED=1
export PMPI_GPU_AWARE=1
Did you bind mpi ranks correctly to cores and placed omp threads close to them?
2)
Now even if all this is correctly done there are still some caveats. First, to find best performance it is probably best to not run a full MD calculation but rather run one SCF calculation and check its runtime via the "LOOP+" counter. This should be were most of the time is spend in the run anyway.
3)
What kind of ALGO are you using for the electronic steps? In general one can say that for big unit cells GPUs can show their strengths but this comes with a caveat: The algo to perform the electronic steps. For example in the RMM or DAV algo only a few orbitals are optimized at the time. Hence, even though you have many KS orbitals only a few a used in LAPACK routines. This is most extreme in the RMM algo with many small matrix operations done. Hence, this is an algorithm decided for CPUs from the early times of VASP. Also for these kind of calculations VASP is often memory bandwidth limited. This completely changes when you run a meta GGA calculation like HSE or you run multiple Kpoints and use KPAR to parallelize over kpoints. HSE calculations are much better suited for GPUs compared to normale PBE / LDA calculations. Can you share some details what kind of job you are running?
All that said, I am advocating to check whether the calculation you are trying to do can actually benefit from more MPI ranks / more GPUs that well. I would like to see how 1,2,4, and 8 GPUS compare for one SCF cycle (LOOP+ count).
Just to illustrate I run an scf calculation for a 384 atom cell (gamma only) with vanilla PBE using the DAV algo (ALGO=Normal). I run this either on a AMD 96 core Milan processor or Nvidia A30 GPUs:
24x AMD EPYC: LOOP+: real time 342.5 sec
48x AMD EPYC: LOOP+: real time 263.9 sec
96x AMD EPYC: LOOP+: real time 389.6 sec
1x A30 (6 thr): LOOP+: real time 164.6 sec
2x A30 (6 thr): LOOP+: real time 111.5 sec
As you can see there is a sweet spot for this ALGO where more cores not necessarily bring more performance (since this is memory limit, more nodes will help still of course). And here you can also see that a single A30 outperforms this decent server CPU by far. And the A30 is a bit dated by now.
Try the scaling test for your GPUs and let me know the results. I am happy to help you get to the bottom of this.
Best,
Alex