Page 1 of 1

Efficiency Degradation of Multi-Job VASP Runs on AMD CPUs

Posted: Sat Feb 07, 2026 7:45 am
by duo_wang1
  • We recently deployed a new node with AMD EPYC 9754 CPUs, hus2.

  • While single VASP jobs perform well, we observe a severe efficiency drop degradation (\(\sim 3.3\times\) slowdown) when running two VASP jobs concurrently on the new node.

  • This severe performance degradation observed on the AMD node is consistent across VASP versions and cannot be mitigated by increasing memory limits.

  • We are trying to understand the underlying cause and whether others have observed similar behavior.

Node Configuration

To investigate this issue, we conducted a systematic benchmark comparing hus2 with our older Intel-based node, hus1. hus1 (hus2): Each CPU has 8 (12) memory channels, so a dual-socket node provides 16 (24) channels in total.
Image
Benchmark Parameters

  • Software: VASP

  • VASP versions tested: 6.4.1, 6.4.3, 6.5.1

  • Identical INCAR, POSCAR, POTCAR, and KPOINTS for all tests

  • Each job used 48 CPU cores

  • Default memory limit: 100 GB per job

  • Multi-job tests: two jobs running concurrently on the same node

  • Wall-clock time measured for identical SCF workloads

Performance comparison between hus1 and hus2 (48 cores, 100 GB memory)

Image
Single-job performance differs only moderately between hus2 and hus1 (\(\sim20\%\)), whereas under dual-job execution the runtime on hus2 increases by \(\sim150\%\) compared to hus1 (see the 8th and 9th columns in Table 2).When running two jobs concurrently instead of a single job, hus2 suffers a dramatic \(\sim3.3\times\) slowdown, whereas hus1 slows down by only \(\sim1.5\times\) (see the 7th and 6th columns in Table 2).
To rule out memory capacity as a factor, we repeated the dual-job tests with higher memory limits (200 GB and no limit) (see Table 3). The runtime remained essentially unchanged, indicating that memory capacity is not the primary cause of the performance degradation on hus2.
Image
Question

Has anyone observed similar results? When running multiple VASP jobs concurrently on the same node, the computational efficiency drops significantly. What are the underlying reasons for this degradation, and are there more effective strategies for job partitioning or resource allocation to mitigate it?


Re: Efficiency Degradation of Multi-Job VASP Runs on AMD CPUs

Posted: Sat Feb 07, 2026 2:09 pm
by andreas.singraber

Hello!

Thanks for providing a clear description of the issue and detailed timing tables, this is very helpful for analyzing the situation!

First, I agree that this is definitely not an issue related to memory capacity. However, memory bandwidth most likely is playing a role here together with MPI rank placement. Actually, I am already unsure whether the 1.5x timing increase on the hus1 system is acceptable, because theoretically the 2x 48-core jobs should perfectly fit the 2x 48-core CPUs. Ideally, one would hope that each job runs entirely on one of the two CPUs and uses only the memory adjacent to this CPU. Without any "overlap" the timings should not degrade compared to the single job performance (maybe a few percent because the OS sometimes needs a little share of the occupied cores). On the other hand, if the two jobs are bandwidth-limited and they need to share the available memory bandwidth the 1.5x timing increase is maybe fine. Anyway, this should be checked by looking at the MPI rank placement.

Maybe some more context: with today's high-core-count CPUs there is unfortunately a lot to consider to effectively make use of all the available compute power. Hence, one cannot view a node like hus1 and hus2 as a "homogeneous" device, rather it has some internal structure which needs to be reflected in the way multi-process programs like VASP get executed on them. Both your machines are dual-socket, i.e., there are two CPUs of the same type available. The first thing to be aware here is that the installed memory is actually split between these CPUs, with one half being closer to CPU A and the other half closer to CPU B. "Closer" in this context means that there is a performance penalty in terms of latency if e.g. CPU A tries to access data which is present in memory attached to CPU B. The technical term for this non-homogeneity is Non-Uniform Memory Access (NUMA). Thus, both of your systems have at least two NUMA domains. However, the situation gets more complicated: both CPUs (Intel 8475B and AMD 9754) have an internal structure which makes the individual cores have different "distances" to memory and L3 cache. Effectively, you can have up to 4 NUMA domains per CPU, ending up with 8 NUMA domains per node (this depends on your BIOS settings). The situation is more pronounced on the AMD CPUs because there are much more cores which have to share actually less total memory bandwidth (see your first table). You can get information about the actual situation on your machines with these commands:

Code: Select all

lscpu
numactl -H
lstopo

The last command lets you also print a nice image like the one on your Wiki: https://vasp.at/wiki/Optimizing_the_par ... e_hardware. Could you post the output of these commands for your systems?

It is important to know that without further intervention the operating system will automatically try to load-balance the cores in your system by shifting the executed program (or its MPI processes) around between the actual physical cores. This sounds nice, but it is potentially harmful for performance because the individual tasks of a multi-process program are often moved away from the memory they were initially close to. Hence, for optimal performance it is always required to "pin" the MPI processes to physical cores. This is usually done via arguments to the mpirun command, see also the Wiki link above. To check how the MPI tasks are actually pinned to cores you can use these additional command-line arguments to mpirun:

Code: Select all

# OpenMPI
mpirun --report-bindings ...

# Intel MPI
mpirun -genv I_MPI_DEBUG=4 ...

# SLURM
srun --cpu-bind=verbose ...

For example, with OpenMPI you should get lines like this on your screen output:

Code: Select all

[wahoo01.vasp.co:2711168] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B./../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[wahoo01.vasp.co:2711168] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [../B./../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[wahoo01.vasp.co:2711168] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [../../B./../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[wahoo01.vasp.co:2711168] MCW rank 3 bound to socket 0[core 3[hwt 0]]: [../../../B./../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[wahoo01.vasp.co:2711168] MCW rank 4 bound to socket 0[core 4[hwt 0]]: [../../../../B./../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[wahoo01.vasp.co:2711168] MCW rank 5 bound to socket 0[core 5[hwt 0]]: [../../../../../B./../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[wahoo01.vasp.co:2711168] MCW rank 6 bound to socket 0[core 6[hwt 0]]: [../../../../../../B./../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
...

With these pinning reports enabled, can you please run your tests again for both systems, each with running a single job and two jobs, and attach the output here? With this information we should be able to get a clearer picture of where the performance is lost, in particular on the hus2 system.

Thank you!

All the best,
Andreas Singraber

P.S. Sorry for the lengthy introduction to CPU architecture, maybe there is nothing new to you at all... but maybe some other users will find this helpful.


Re: Efficiency Degradation of Multi-Job VASP Runs on AMD CPUs

Posted: Sun Feb 08, 2026 8:41 am
by duo_wang1

Image


Re: Follow-up: NUMA topology and MPI pinning on hus1 and hus2

Posted: Thu Mar 05, 2026 11:46 am
by duo_wang1
andreas.singraber wrote: Sat Feb 07, 2026 2:09 pm

Hello!

Thanks for providing a clear description of the issue and detailed timing tables, this is very helpful for analyzing the situation!

First, I agree that this is definitely not an issue related to memory capacity. However, memory bandwidth most likely is playing a role here together with MPI rank placement. Actually, I am already unsure whether the 1.5x timing increase on the hus1 system is acceptable, because theoretically the 2x 48-core jobs should perfectly fit the 2x 48-core CPUs. Ideally, one would hope that each job runs entirely on one of the two CPUs and uses only the memory adjacent to this CPU. Without any "overlap" the timings should not degrade compared to the single job performance (maybe a few percent because the OS sometimes needs a little share of the occupied cores). On the other hand, if the two jobs are bandwidth-limited and they need to share the available memory bandwidth the 1.5x timing increase is maybe fine. Anyway, this should be checked by looking at the MPI rank placement.

Maybe some more context: with today's high-core-count CPUs there is unfortunately a lot to consider to effectively make use of all the available compute power. Hence, one cannot view a node like hus1 and hus2 as a "homogeneous" device, rather it has some internal structure which needs to be reflected in the way multi-process programs like VASP get executed on them. Both your machines are dual-socket, i.e., there are two CPUs of the same type available. The first thing to be aware here is that the installed memory is actually split between these CPUs, with one half being closer to CPU A and the other half closer to CPU B. "Closer" in this context means that there is a performance penalty in terms of latency if e.g. CPU A tries to access data which is present in memory attached to CPU B. The technical term for this non-homogeneity is Non-Uniform Memory Access (NUMA). Thus, both of your systems have at least two NUMA domains. However, the situation gets more complicated: both CPUs (Intel 8475B and AMD 9754) have an internal structure which makes the individual cores have different "distances" to memory and L3 cache. Effectively, you can have up to 4 NUMA domains per CPU, ending up with 8 NUMA domains per node (this depends on your BIOS settings). The situation is more pronounced on the AMD CPUs because there are much more cores which have to share actually less total memory bandwidth (see your first table). You can get information about the actual situation on your machines with these commands:

Code: Select all

lscpu
numactl -H
lstopo

The last command lets you also print a nice image like the one on your Wiki: https://vasp.at/wiki/Optimizing_the_par ... e_hardware. Could you post the output of these commands for your systems?

It is important to know that without further intervention the operating system will automatically try to load-balance the cores in your system by shifting the executed program (or its MPI processes) around between the actual physical cores. This sounds nice, but it is potentially harmful for performance because the individual tasks of a multi-process program are often moved away from the memory they were initially close to. Hence, for optimal performance it is always required to "pin" the MPI processes to physical cores. This is usually done via arguments to the mpirun command, see also the Wiki link above. To check how the MPI tasks are actually pinned to cores you can use these additional command-line arguments to mpirun:

Code: Select all

# OpenMPI
mpirun --report-bindings ...

# Intel MPI
mpirun -genv I_MPI_DEBUG=4 ...

# SLURM
srun --cpu-bind=verbose ...

For example, with OpenMPI you should get lines like this on your screen output:

Code: Select all

[wahoo01.vasp.co:2711168] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B./../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[wahoo01.vasp.co:2711168] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [../B./../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[wahoo01.vasp.co:2711168] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [../../B./../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[wahoo01.vasp.co:2711168] MCW rank 3 bound to socket 0[core 3[hwt 0]]: [../../../B./../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[wahoo01.vasp.co:2711168] MCW rank 4 bound to socket 0[core 4[hwt 0]]: [../../../../B./../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[wahoo01.vasp.co:2711168] MCW rank 5 bound to socket 0[core 5[hwt 0]]: [../../../../../B./../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[wahoo01.vasp.co:2711168] MCW rank 6 bound to socket 0[core 6[hwt 0]]: [../../../../../../B./../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
...

With these pinning reports enabled, can you please run your tests again for both systems, each with running a single job and two jobs, and attach the output here? With this information we should be able to get a clearer picture of where the performance is lost, in particular on the hus2 system.

Thank you!

All the best,
Andreas Singraber

P.S. Sorry for the lengthy introduction to CPU architecture, maybe there is nothing new to you at all... but maybe some other users will find this helpful.

Dear Andreas,

Thanks a lot for the detailed and patient explanation. Your overview of the NUMA architecture, memory bandwidth, and MPI rank placement was really clear, and it definitely helped us better understand the performance issues we’re seeing.
First, I agree that this is not a memory capacity issue but is more likely related to memory bandwidth and how the MPI ranks are actually placed. As you pointed out, in theory, on a dual-socket node with two 48-core CPUs, running two 48-core jobs at the same time should not lead to a noticeable performance drop, as long as each job is fully bound to its own CPU and mainly accesses the memory adjacent to this CPU. The \(\sim1.5\times\) increase in runtime we’re seeing on hus1 does suggest we should take a closer look at the MPI rank placement to make sure it’s doing what we expect.
As for the NUMA topology, we’ll take your advice and gather the following information on hus1 and hus2:
command lscpu
hus1

Code: Select all

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  1
Core(s) per socket:  48
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               143
Model name:          Intel(R) Xeon(R) Platinum 8475B
Stepping:            8
CPU MHz:             2700.000
CPU max MHz:         3800.0000
CPU min MHz:         800.0000
BogoMIPS:            5400.00
Virtualization:      VT-x
L1d cache:           48K
L1i cache:           32K
L2 cache:            2048K
L3 cache:            99840K
NUMA node0 CPU(s):   0-47
NUMA node1 CPU(s):   48-95

hus2

Code: Select all

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  1
Core(s) per socket:  128
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               160
Model name:          AMD EPYC 9754 128-Core Processor
Stepping:            2
CPU MHz:             2250.000
CPU max MHz:         3100.3411
CPU min MHz:         1500.0000
BogoMIPS:            4493.37
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            16384K
NUMA node0 CPU(s):   0-127
NUMA node1 CPU(s):   128-255
[\code]
[b]command[/b] [tt][color=#008000]numactl -H[/color][/tt] 
[b][i]hus1[/i][/b]
[code]
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 0 size: 257639 MB
node 0 free: 226996 MB
node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 1 size: 257989 MB
node 1 free: 157854 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10     

hus2

Code: Select all

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 0 size: 386182 MB
node 0 free: 284077 MB
node 1 cpus: 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
node 1 size: 386974 MB
node 1 free: 203971 MB
node distances:
node   0   1 
  0:  10  32 
  1:  32  10     

command lstopo
hus1
Image
hus2
Image
We reran the same job on hus1 and hus2 to check the runtime again. For Intel MPI, we enabled pinning using mpirun -genv I_MPI_PIN=ON, and also set mpirun -genv I_MPI_DEBUG=4 to print the MPI pinning information so we could confirm how the ranks are mapped to sockets and cores.
We’ve collected the outputs and the MPI binding reports and shared them below for your reference.
Image
Based on the test results (see in Table 1), on the hus1 node we do not observe any noticeable difference in computational performance before and after MPI pinning, neither for single-job nor for dual-job runs. The situation on hus2 seems to be somewhat different. For single-job execution, MPI pinning has almost no visible impact on performance. However, when running two jobs concurrently, the behavior before and after pinning differs more clearly. Without MPI pinning, both jobs show very similar runtimes of \(\sim\) 1200 s. After enabling MPI pinning, one of the jobs becomes slightly faster (\(\sim\) 994 s), while the other slows down significantly (\(\sim\) 1872 s}). As a result, MPI pinning does not lead to an overall performance improvement for concurrent multi-job execution on hus2, but instead introduces a pronounced imbalance between the two jobs.
Running a single job
hus1

Code: Select all

[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       1832336  hus1       {0,1}
[0] MPI startup(): 1       1832337  hus1       {2,3}
[0] MPI startup(): 2       1832338  hus1       {4,5}
...
[0] MPI startup(): 46      1832383  hus1       {92,93}
[0] MPI startup(): 47      1832384  hus1       {94,95}
 running   48 mpi-ranks, on    1 nodes
 distrk:  each k-point on   48 cores,    1 groups
 distr:  one band on   12 cores,    4 groups    

hus2

Code: Select all

[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       2554478  hus2       {0-5}
[0] MPI startup(): 1       2554479  hus2       {6-11}
[0] MPI startup(): 2       2554480  hus2       {12-17}
...
[0] MPI startup(): 46      2554524  hus2       {246-250}
[0] MPI startup(): 47      2554525  hus2       {251-255}
 running   48 mpi-ranks, on    1 nodes
 distrk:  each k-point on   48 cores,    1 groups
 distr:  one band on   12 cores,    4 groups  

Running two jobs on the same node
hus1

Code: Select all

[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       1832750  hus1       {0}
[0] MPI startup(): 1       1832751  hus1       {1}
[0] MPI startup(): 2       1832752  hus1       {2}
...
[0] MPI startup(): 46      1832841  hus1       {46}
[0] MPI startup(): 47      1832843  hus1       {47}
 running   48 mpi-ranks, on    1 nodes
 distrk:  each k-point on   48 cores,    1 groups
 distr:  one band on   12 cores,    4 groups    

hus2

Code: Select all

[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       2555189  hus2       {0}
[0] MPI startup(): 1       2555190  hus2       {1}
[0] MPI startup(): 2       2555191  hus2       {2}
[0] MPI startup(): 3       2555192  hus2       {3}
[0] MPI startup(): 4       2555193  hus2       {4}
[0] MPI startup(): 5       2555194  hus2       {5}
[0] MPI startup(): 6       2555195  hus2       {6}
[0] MPI startup(): 7       2555196  hus2       {7}
[0] MPI startup(): 8       2555197  hus2       {8}
[0] MPI startup(): 9       2555198  hus2       {9}
[0] MPI startup(): 10      2555199  hus2       {10}
[0] MPI startup(): 11      2555200  hus2       {11}
[0] MPI startup(): 12      2555201  hus2       {12}
[0] MPI startup(): 13      2555202  hus2       {13}
[0] MPI startup(): 14      2555203  hus2       {14}
[0] MPI startup(): 15      2555204  hus2       {15}
[0] MPI startup(): 16      2555205  hus2       {32}
[0] MPI startup(): 17      2555206  hus2       {33}
[0] MPI startup(): 18      2555207  hus2       {34}
[0] MPI startup(): 19      2555208  hus2       {35}
[0] MPI startup(): 20      2555209  hus2       {36}
[0] MPI startup(): 21      2555210  hus2       {37}
[0] MPI startup(): 22      2555211  hus2       {38}
[0] MPI startup(): 23      2555212  hus2       {39}
[0] MPI startup(): 24      2555213  hus2       {40}
[0] MPI startup(): 25      2555214  hus2       {41}
[0] MPI startup(): 26      2555215  hus2       {42}
[0] MPI startup(): 27      2555216  hus2       {43}
[0] MPI startup(): 28      2555217  hus2       {44}
[0] MPI startup(): 29      2555218  hus2       {45}
[0] MPI startup(): 30      2555219  hus2       {46}
[0] MPI startup(): 31      2555220  hus2       {47}
[0] MPI startup(): 32      2555221  hus2       {96}
[0] MPI startup(): 33      2555222  hus2       {97}
[0] MPI startup(): 34      2555223  hus2       {98}
[0] MPI startup(): 35      2555224  hus2       {99}
[0] MPI startup(): 36      2555225  hus2       {100}
[0] MPI startup(): 37      2555226  hus2       {101}
[0] MPI startup(): 38      2555227  hus2       {102}
[0] MPI startup(): 39      2555228  hus2       {103}
[0] MPI startup(): 40      2555229  hus2       {104}
[0] MPI startup(): 41      2555230  hus2       {105}
[0] MPI startup(): 42      2555231  hus2       {106}
[0] MPI startup(): 43      2555232  hus2       {107}
[0] MPI startup(): 44      2555233  hus2       {108}
[0] MPI startup(): 45      2555234  hus2       {109}
[0] MPI startup(): 46      2555235  hus2       {110}
[0] MPI startup(): 47      2555236  hus2       {111}
 running   48 mpi-ranks, on    1 nodes
 distrk:  each k-point on   48 cores,    1 groups
 distr:  one band on   12 cores,    4 groups

Regarding the MPI pinning reports, on hus1 we observe that the pinning behavior changes depending on whether one or two jobs are running. When a single job is executed, each MPI rank is pinned to two CPU cores. In this case, the 0–7 and 25–32 MPI ranks are pinned to six CPU cores, while the remaining ranks are pinned to five CPU cores. When two jobs are run concurrently, each MPI rank is bound to a single CPU core.
Image
On hus2, after enabling MPI rank pinning, a clear imbalance in execution time can be observed when two jobs are run concurrently. As the number of concurrent jobs increases, the execution times on hus2 exhibit increasingly pronounced performance imbalance (see Table 2). Could this behavior be related to contention for shared resources within the node, such as limitations in memory bandwidth or differences in memory access locality arising from the NUMA architecture?
Thanks again for all the detailed explanations and guidance — they’ve been really helpful for understanding performance behavior on high core-count CPUs. We’re looking forward to your analysis of the data, which should help us better pinpoint the source of the performance degradation on hus2.