Optimizing the parallelization - Revision history

Wolloch at 08:53, 16 March 2026

2026-03-16T08:53:30Z

← Older revision		Revision as of 08:53, 16 March 2026
Line 35:		Line 35:
	For the common case of [[:Category: Electronic minimization\|electronic minimization]] calculations, the following rules of thumb apply:		For the common case of [[:Category: Electronic minimization\|electronic minimization]] calculations, the following rules of thumb apply:
	* Aim to set the number of ranks to the default value of {{TAG\|NBANDS}} divided by a small integer. Note that the number of bands ({{TAG\|NBANDS}}) is increased to accommodate the number of ranks.		* Aim to set the number of ranks to the default value of {{TAG\|NBANDS}} divided by a small integer. Note that the number of bands ({{TAG\|NBANDS}}) is increased to accommodate the number of ranks.
	* Choose {{TAG\|NCORE}} as a factor of the cores per node to avoid communicating between nodes for the FFTs. Mind that {{TAG\|NCORE}} cannot be set with [[Combining MPI and OpenMP\|OpenMP]] threading and/or the [[~~OpenACC~~ GPU ~~port~~ of VASP~~\|OpenACC GPU port~~]]. Use the number of OpenMP threads in this case to for fine grained control over parallelization.		* Choose {{TAG\|NCORE}} as a factor of the cores per node to avoid communicating between nodes for the FFTs. Mind that {{TAG\|NCORE}} cannot be set with [[Combining MPI and OpenMP\|OpenMP]] threading and/or the [[GPU ports of VASP]]. Use the number of OpenMP threads in this case for fine-grained control over parallelization.
	* The '''k'''-point parallelization is efficient but requires additional [[:Category:Memory\|memory]]. Given sufficient [[:Category:Memory\|memory]], increase {{TAG\|KPAR}} up to the number of irreducible '''k''' points. Keep in mind that {{TAG\|KPAR}} should factorize the number of '''k''' points. This is especially important to reduce MPI communication between [[Optimizing_the_parallelization#Understanding_the_hardware\|NUMA domains]] and compute nodes. Read also {{TAG\|KPAR}} for more information.		* The '''k'''-point parallelization is efficient but requires additional [[:Category:Memory\|memory]]. Given sufficient [[:Category:Memory\|memory]], increase {{TAG\|KPAR}} up to the number of irreducible '''k''' points. Keep in mind that {{TAG\|KPAR}} should factorize the number of '''k''' points. This is especially important to reduce MPI communication between [[Optimizing_the_parallelization#Understanding_the_hardware\|NUMA domains]] and compute nodes. Read also {{TAG\|KPAR}} for more information.
	{{NB\|tip\|If the number of '''k''' points is a prime number (or does not factorize well), copy the {{FILE\|IBZKPT}} file to {{FILE\|KPOINTS}} and add zero-weigthed points.}}		{{NB\|tip\|If the number of '''k''' points is a prime number (or does not factorize well), copy the {{FILE\|IBZKPT}} file to {{FILE\|KPOINTS}} and add zero-weigthed points.}}

Singraber at 14:26, 9 February 2026

2026-02-09T14:26:34Z

← Older revision		Revision as of 14:26, 9 February 2026
Line 54:		Line 54:
	Each hardware setup is unique since nowadays CPUs and GPUs alike consists not of a single monolithic core anymore. Each processor might consist of multiple tiles or domains that each are connected to their own memory. To leverage the full potential of the hardware one has to reduce communication between cores that are "far away" from each other. To understand this better the concept of NUMA (Non-Uniform Memory Access) domains is useful. In NUMA architecture, memory access times vary depending on the proximity of the memory to the CPU core requesting it — local memory access is quicker, while remote access across domains incurs latency. Modern multi-socket and multi-core systems use NUMA to improve scalability and performance by keeping data close to the cores that use it most often. Understanding and managing NUMA domains is essential for optimizing memory placement and parallel performance in HPC.		Each hardware setup is unique since nowadays CPUs and GPUs alike consists not of a single monolithic core anymore. Each processor might consist of multiple tiles or domains that each are connected to their own memory. To leverage the full potential of the hardware one has to reduce communication between cores that are "far away" from each other. To understand this better the concept of NUMA (Non-Uniform Memory Access) domains is useful. In NUMA architecture, memory access times vary depending on the proximity of the memory to the CPU core requesting it — local memory access is quicker, while remote access across domains incurs latency. Modern multi-socket and multi-core systems use NUMA to improve scalability and performance by keeping data close to the cores that use it most often. Understanding and managing NUMA domains is essential for optimizing memory placement and parallel performance in HPC.

	[[File:Numa example.png\|alt=NUMA example Epyc 7543\|none\|thumb\|950x950px\|Fig. 1: Example output of <code>lstopo</code> command for an AMD Epyc ~~7543~~ processor. The processor consists of multiple hardware tiles that display as 4 distinct NUMA domains. Disclaimer: This is an example and not a general hardware recommentation.]]		[[File:Numa example.png\|alt=NUMA example Epyc 7543\|none\|thumb\|950x950px\|Fig. 1: Example output of <code>lstopo</code> command for an AMD Epyc 7543P processor. The processor consists of multiple hardware tiles that display as 4 distinct NUMA domains. Disclaimer: This is an example and not a general hardware recommentation.]]

	Fig. 1 displays the NUMA architecture of an AMD Epyc ~~7543~~ processor. In total this processor has 512 GB of system memory available. However, this memory is spread across 4 NUMA nodes (pink bars). Each NUMA domain consists of 8 physical CPU cores (each able to work on 2 threads, i.e. P#0/P#32 belong to the same physical core). These domains share not only a chunk of the system memory, but also their own L3 cache. Hence, it is important for this specific processor that not more than 8 cores work on a specific task. If more cores work on the same memory they have to communicate to the other NUMA domains with some latency. This is important for choosing parallelization tags. Hence, we know now that 8 cores is a good work-group size for VASP on this processor, and that up to 32 cores can work on the same memory with a small penalty. If we use more than 32 cores, we have to communicate between two different processors (nodes / sockets) via a different protocol (MPI); much higher latency. If the tool <code>lstopo</code> is not available on your system, you can use the command <code>numactl --hardware</code> to get text output of the same information:		Fig. 1 displays the NUMA architecture of an AMD Epyc 7543P processor. In total this processor has 512 GB of system memory available. However, this memory is spread across 4 NUMA nodes (pink bars). Each NUMA domain consists of 8 physical CPU cores (each able to work on 2 threads, i.e. P#0/P#32 belong to the same physical core). These domains share not only a chunk of the system memory, but also their own L3 cache. Hence, it is important for this specific processor that not more than 8 cores work on a specific task. If more cores work on the same memory they have to communicate to the other NUMA domains with some latency. This is important for choosing parallelization tags. Hence, we know now that 8 cores is a good work-group size for VASP on this processor, and that up to 32 cores can work on the same memory with a small penalty. If we use more than 32 cores, we have to communicate between two different processors (nodes / sockets) via a different protocol (MPI); much higher latency. If the tool <code>lstopo</code> is not available on your system, you can use the command <code>numactl --hardware</code> to get text output of the same information:

	# numactl --hardware		# numactl --hardware

Singraber at 07:58, 24 October 2025

2025-10-24T07:58:35Z

Huebsch at 12:13, 22 October 2025

2025-10-22T12:13:22Z

← Older revision		Revision as of 12:13, 22 October 2025
Line 54:		Line 54:
	Each hardware setup is unique since nowadays CPUs and GPUs alike consists not of a single monolithic core anymore. Each processor might consist of multiple tiles or domains that each are connected to their own memory. To leverage the full potential of the hardware one has to reduce communication between cores that are "far away" from each other. To understand this better the concept of NUMA (Non-Uniform Memory Access) domains is useful. In NUMA architecture, memory access times vary depending on the proximity of the memory to the CPU core requesting it — local memory access is quicker, while remote access across domains incurs latency. Modern multi-socket and multi-core systems use NUMA to improve scalability and performance by keeping data close to the cores that use it most often. Understanding and managing NUMA domains is essential for optimizing memory placement and parallel performance in HPC.		Each hardware setup is unique since nowadays CPUs and GPUs alike consists not of a single monolithic core anymore. Each processor might consist of multiple tiles or domains that each are connected to their own memory. To leverage the full potential of the hardware one has to reduce communication between cores that are "far away" from each other. To understand this better the concept of NUMA (Non-Uniform Memory Access) domains is useful. In NUMA architecture, memory access times vary depending on the proximity of the memory to the CPU core requesting it — local memory access is quicker, while remote access across domains incurs latency. Modern multi-socket and multi-core systems use NUMA to improve scalability and performance by keeping data close to the cores that use it most often. Understanding and managing NUMA domains is essential for optimizing memory placement and parallel performance in HPC.

	[[File:Numa example.png\|alt=NUMA example Epyc 7543\|none\|thumb\|950x950px\|Fig. 1: Example output of <code>lstopo</code> command for an AMD Epyc 7543 processor. The processor consists of multiple hardware tiles that display as 4 distinct NUMA domains.]]		[[File:Numa example.png\|alt=NUMA example Epyc 7543\|none\|thumb\|950x950px\|Fig. 1: Example output of <code>lstopo</code> command for an AMD Epyc 7543 processor. The processor consists of multiple hardware tiles that display as 4 distinct NUMA domains. Disclaimer: This is an example and not a general hardware recommentation.]]

	Fig. 1 displays the NUMA architecture of an AMD Epyc 7543 processor~~. Disclaimer: This is an example and not a general hardware recommentation~~. In total this processor has 512 GB of system memory available. However, this memory is spread across 4 NUMA nodes (pink bars). Each NUMA domain consists of 8 physical CPU cores (each able to work on 2 threads, i.e. P#0/P#32 belong to the same physical core). These domains share not only a chunk of the system memory, but also their own L3 cache. Hence, it is important for this specific processor that not more than 8 cores work on a specific task. If more cores work on the same memory they have to communicate to the other NUMA domains with some latency. This is important for choosing parallelization tags. Hence, we know now that 8 cores is a good work-group size for VASP on this processor, and that up to 32 cores can work on the same memory with a small penalty. If we use more than 32 cores, we have to communicate between two different processors (nodes / sockets) via a different protocol (MPI); much higher latency. If the tool <code>lstopo</code> is not available on your system, you can use the command <code>numactl --hardware</code> to get text output of the same information:		Fig. 1 displays the NUMA architecture of an AMD Epyc 7543 processor. In total this processor has 512 GB of system memory available. However, this memory is spread across 4 NUMA nodes (pink bars). Each NUMA domain consists of 8 physical CPU cores (each able to work on 2 threads, i.e. P#0/P#32 belong to the same physical core). These domains share not only a chunk of the system memory, but also their own L3 cache. Hence, it is important for this specific processor that not more than 8 cores work on a specific task. If more cores work on the same memory they have to communicate to the other NUMA domains with some latency. This is important for choosing parallelization tags. Hence, we know now that 8 cores is a good work-group size for VASP on this processor, and that up to 32 cores can work on the same memory with a small penalty. If we use more than 32 cores, we have to communicate between two different processors (nodes / sockets) via a different protocol (MPI); much higher latency. If the tool <code>lstopo</code> is not available on your system, you can use the command <code>numactl --hardware</code> to get text output of the same information:

	# numactl --hardware		# numactl --hardware

Huebsch at 12:07, 22 October 2025

2025-10-22T12:07:37Z

Show changes

Hampel: /* Optimizing the parallelization */

2025-10-17T12:59:14Z

Optimizing the parallelization

← Older revision		Revision as of 12:59, 17 October 2025
Line 41:		Line 41:
	** {{TAGO\|KPAR\|no of GPUs}} if memory allows		** {{TAGO\|KPAR\|no of GPUs}} if memory allows
	** [[Combining_MPI_and_OpenMP\|OpenMP threading]] can be quite important for the parts that still run on the host, because usually no of GPUs is rather small.		** [[Combining_MPI_and_OpenMP\|OpenMP threading]] can be quite important for the parts that still run on the host, because usually no of GPUs is rather small.
			** {{TAG\|NSIM}} controls how many orbitals are worked on in parallel. For GPUs the default value of 4 is way too small. Increase the value. Values as large as {{TAGO\|NSIM\|32}} might be beneficial.
	** small systems might perform very poorly in standard DFT (this is different for hybrid, GW, or BSE calculations). GPUs show their benefit when there are many bands or ions over which the code is parallelized.		** small systems might perform very poorly in standard DFT (this is different for hybrid, GW, or BSE calculations). GPUs show their benefit when there are many bands or ions over which the code is parallelized.
	* Finally, use the {{TAG\|IMAGES}} tag to split several VASP runs into separate calculations. The limit is dictated by the number of desired calculations.		* Finally, use the {{TAG\|IMAGES}} tag to split several VASP runs into separate calculations. The limit is dictated by the number of desired calculations.

Hampel: /* Optimizing the parallelization */

2025-10-17T12:12:24Z

Optimizing the parallelization

← Older revision		Revision as of 12:12, 17 October 2025
Line 36:		Line 36:
	* The '''k'''-point parallelization is efficient but requires additional [[:Category:Memory\|memory]]. Given sufficient [[:Category:Memory\|memory]], increase {{TAG\|KPAR}} up to the number of irreducible '''k''' points. Keep in mind that {{TAG\|KPAR}} should factorize the number of '''k''' points. This is especially important to reduce MPI communication between [[Optimizing_the_parallelization#Understanding_the_hardware\|NUMA domains]] and compute nodes. Read also {{TAG\|KPAR}} for more information.		* The '''k'''-point parallelization is efficient but requires additional [[:Category:Memory\|memory]]. Given sufficient [[:Category:Memory\|memory]], increase {{TAG\|KPAR}} up to the number of irreducible '''k''' points. Keep in mind that {{TAG\|KPAR}} should factorize the number of '''k''' points. This is especially important to reduce MPI communication between [[Optimizing_the_parallelization#Understanding_the_hardware\|NUMA domains]] and compute nodes. Read also {{TAG\|KPAR}} for more information.
	* For bulk systems with small unit cells ({{TAG\|NBANDS}} is small, NKPTS=no of k points is large), {{TAGO\|NCORE\|1}} and {{TAGO\|KPAR\|NKPTS}} is optimal.		* For bulk systems with small unit cells ({{TAG\|NBANDS}} is small, NKPTS=no of k points is large), {{TAGO\|NCORE\|1}} and {{TAGO\|KPAR\|NKPTS}} is optimal.
			* Lookout for the '''LOOP''' timer for each electronic minimization step, and the overall '''LOOP+''' timer that includes all electronic minimization steps plus post-processing like forces.
	* For running on GPUs:		* For running on GPUs:
	** MPI ranks = no of GPUs		** MPI ranks = no of GPUs
	** {{TAGO\|KPAR\|no of GPUs}} if memory allows		** {{TAGO\|KPAR\|no of GPUs}} if memory allows
	** [[Combining_MPI_and_OpenMP\|OpenMP threading]] can be quite important for the parts that still run on the host, because usually no of GPUs is rather small.		** [[Combining_MPI_and_OpenMP\|OpenMP threading]] can be quite important for the parts that still run on the host, because usually no of GPUs is rather small.
			** small systems might perform very poorly in standard DFT (this is different for hybrid, GW, or BSE calculations). GPUs show their benefit when there are many bands or ions over which the code is parallelized.
	* Finally, use the {{TAG\|IMAGES}} tag to split several VASP runs into separate calculations. The limit is dictated by the number of desired calculations.		* Finally, use the {{TAG\|IMAGES}} tag to split several VASP runs into separate calculations. The limit is dictated by the number of desired calculations.

Hampel: /* Understanding the hardware */

2025-10-17T12:03:11Z

Understanding the hardware

← Older revision		Revision as of 12:03, 17 October 2025
Line 70:		Line 70:


	Another point that immediately follows from the concept of NUMA domains is that MPI ranks should not jump between NUMA domains; they would loose access to their previous cache. To avoid this one uses ''pinning'' or ''binding'' of MPI ranks. This becomes also important for combining MPI with OpenMP threading. Threads should live close to their parent rank and also not move to share the same caches, i.e. live in the same NUMA domain. Binding ranks and threads to specific cores or regions depends on the software setup. For example in the popular SLURM jobscript submission system this is done via the flag <code>--cpu-bind=cores</code> , for openmpi this is done via <code>--bind-to=core</code>, and for intel MPI via <code>-genv I_MPI_PIN=ON</code>. To bind threads use the environment variable <code>OMP_PROC_BIND</code> and set it to <code>true</code> or <code>close</code>. For threads it is also important to tell them where they should run via <code>OMP_PLACES</code>, which best is set to <code>cores</code>.		Another point that immediately follows from the concept of NUMA domains is that MPI ranks should not jump between NUMA domains; they would loose access to their previous cache. To avoid this one uses ''pinning'' or ''binding'' of MPI ranks. This becomes also important for combining MPI with OpenMP threading. Threads should live close to their parent rank and also not move to share the same caches, i.e. live in the same NUMA domain. Binding ranks and threads to specific cores or regions depends on the software setup. For example in the popular SLURM jobscript submission system this is done via the flag <code>--cpu-bind=cores</code> , for openmpi this is done via <code>--bind-to=core</code>, and for intel MPI via <code>-genv I_MPI_PIN=ON</code>. To bind threads use the environment variable <code>OMP_PROC_BIND</code> and set it to <code>true</code> or <code>close</code>. For threads it is also important to tell them where they should run via <code>OMP_PLACES</code>, which best is set to <code>cores</code>.

			Now let's take a look at our current example processor in combination of the discussion of parallelization flags in the above section. If we want to leverage the potential of this hardware in the best way we want to have groups of 8 cores work as independent as possible. Hence, we should choose {{TAG\|NBANDS}}, and the number of '''k'''-points in the IBZ such that they can be divided by 8. Then we will find a good setting with {{TAGO\|NCORE\|8}} and {{TAGO\|KPAR\|4}} (or 4 times the number of compute nodes with this processor you have).

			For benchmarking you want now to start with this settings and try to change the parallelization flags and observe how performance changes.

	==Related tags an articles==		==Related tags an articles==

Hampel at 11:48, 17 October 2025

2025-10-17T11:48:03Z

@@ Line 3: / Line 3: @@
 __TOC__
 ==Optimizing the parallelization==
@@ Line 70: / Line 41: @@
 ** [[Combining_MPI_and_OpenMP|OpenMP threading]] can be quite important for the parts that still run on the host, because usually no of GPUs is rather small.
 * Finally, use the {{TAG|IMAGES}} tag to split several VASP runs into separate calculations. The limit is dictated by the number of desired calculations.
 ==Related tags an articles==

Hampel: /* Optimizing the parallelization */

2025-10-17T10:31:20Z

Optimizing the parallelization

← Older revision		Revision as of 10:31, 17 October 2025
Line 62:		Line 62:
	For the common case of [[:Category: Electronic minimization\|electronic minimization]] calculations, the following rules of thumb apply:		For the common case of [[:Category: Electronic minimization\|electronic minimization]] calculations, the following rules of thumb apply:
	* Aim to set the number of ranks to the default value of {{TAG\|NBANDS}} divided by a small integer. Note that VASP will increase {{TAG\|NBANDS}} to accommodate the number of ranks.		* Aim to set the number of ranks to the default value of {{TAG\|NBANDS}} divided by a small integer. Note that VASP will increase {{TAG\|NBANDS}} to accommodate the number of ranks.
	* Choose {{TAG\|NCORE}} as a factor of the cores per node to avoid communicating between nodes for the FFTs. Mind that {{TAG\|NCORE}} cannot be set with [[Combining MPI and OpenMP\|OpenMP]] threading and/or the [[OpenACC GPU port of VASP\|OpenACC GPU port]].		* Choose {{TAG\|NCORE}} as a factor of the cores per node to avoid communicating between nodes for the FFTs. Mind that {{TAG\|NCORE}} cannot be set with [[Combining MPI and OpenMP\|OpenMP]] threading and/or the [[OpenACC GPU port of VASP\|OpenACC GPU port]]. Use the number of OpenMP threads in this case to for fine grained control over parallelization.
	* The '''k'''-point parallelization is efficient but requires additional [[:Category:Memory\|memory]]. Given sufficient [[:Category:Memory\|memory]], increase {{TAG\|KPAR}} up to the number of irreducible '''k''' points. Keep in mind that {{TAG\|KPAR}} should factorize the number of '''k''' points. This is especially important to reduce MPI communication between [[Optimizing_the_parallelization#Understanding_the_hardware\|NUMA domains]] and compute nodes. Read also {{TAG\|KPAR}} for more information.		* The '''k'''-point parallelization is efficient but requires additional [[:Category:Memory\|memory]]. Given sufficient [[:Category:Memory\|memory]], increase {{TAG\|KPAR}} up to the number of irreducible '''k''' points. Keep in mind that {{TAG\|KPAR}} should factorize the number of '''k''' points. This is especially important to reduce MPI communication between [[Optimizing_the_parallelization#Understanding_the_hardware\|NUMA domains]] and compute nodes. Read also {{TAG\|KPAR}} for more information.
	* For bulk systems with small unit cells ({{TAG\|NBANDS}} is small, NKPTS=no of k points is large), {{TAGO\|NCORE\|1}} and {{TAGO\|KPAR\|NKPTS}} is optimal.		* For bulk systems with small unit cells ({{TAG\|NBANDS}} is small, NKPTS=no of k points is large), {{TAGO\|NCORE\|1}} and {{TAGO\|KPAR\|NKPTS}} is optimal.
	* For running on GPUs~~, you may try~~		* For running on GPUs:
	** MPI ranks = no of GPUs		** MPI ranks = no of GPUs
	** {{TAGO\|KPAR\|no of GPUs}} if memory allows		** {{TAGO\|KPAR\|no of GPUs}} if memory allows

← Older revision		Revision as of 08:53, 16 March 2026
Line 35:		Line 35:
	For the common case of [[:Category: Electronic minimization\|electronic minimization]] calculations, the following rules of thumb apply:		For the common case of [[:Category: Electronic minimization\|electronic minimization]] calculations, the following rules of thumb apply:
	* Aim to set the number of ranks to the default value of {{TAG\|NBANDS}} divided by a small integer. Note that the number of bands ({{TAG\|NBANDS}}) is increased to accommodate the number of ranks.		* Aim to set the number of ranks to the default value of {{TAG\|NBANDS}} divided by a small integer. Note that the number of bands ({{TAG\|NBANDS}}) is increased to accommodate the number of ranks.
	* Choose {{TAG\|NCORE}} as a factor of the cores per node to avoid communicating between nodes for the FFTs. Mind that {{TAG\|NCORE}} cannot be set with [[Combining MPI and OpenMP\|OpenMP]] threading and/or the [[~~OpenACC~~ GPU ~~port~~ of VASP~~\|OpenACC GPU port~~]]. Use the number of OpenMP threads in this case to for fine grained control over parallelization.		* Choose {{TAG\|NCORE}} as a factor of the cores per node to avoid communicating between nodes for the FFTs. Mind that {{TAG\|NCORE}} cannot be set with [[Combining MPI and OpenMP\|OpenMP]] threading and/or the [[GPU ports of VASP]]. Use the number of OpenMP threads in this case for fine-grained control over parallelization.
	* The '''k'''-point parallelization is efficient but requires additional [[:Category:Memory\|memory]]. Given sufficient [[:Category:Memory\|memory]], increase {{TAG\|KPAR}} up to the number of irreducible '''k''' points. Keep in mind that {{TAG\|KPAR}} should factorize the number of '''k''' points. This is especially important to reduce MPI communication between [[Optimizing_the_parallelization#Understanding_the_hardware\|NUMA domains]] and compute nodes. Read also {{TAG\|KPAR}} for more information.		* The '''k'''-point parallelization is efficient but requires additional [[:Category:Memory\|memory]]. Given sufficient [[:Category:Memory\|memory]], increase {{TAG\|KPAR}} up to the number of irreducible '''k''' points. Keep in mind that {{TAG\|KPAR}} should factorize the number of '''k''' points. This is especially important to reduce MPI communication between [[Optimizing_the_parallelization#Understanding_the_hardware\|NUMA domains]] and compute nodes. Read also {{TAG\|KPAR}} for more information.
	{{NB\|tip\|If the number of '''k''' points is a prime number (or does not factorize well), copy the {{FILE\|IBZKPT}} file to {{FILE\|KPOINTS}} and add zero-weigthed points.}}		{{NB\|tip\|If the number of '''k''' points is a prime number (or does not factorize well), copy the {{FILE\|IBZKPT}} file to {{FILE\|KPOINTS}} and add zero-weigthed points.}}

← Older revision		Revision as of 10:31, 17 October 2025
Line 62:		Line 62:
	For the common case of [[:Category: Electronic minimization\|electronic minimization]] calculations, the following rules of thumb apply:		For the common case of [[:Category: Electronic minimization\|electronic minimization]] calculations, the following rules of thumb apply:
	* Aim to set the number of ranks to the default value of {{TAG\|NBANDS}} divided by a small integer. Note that VASP will increase {{TAG\|NBANDS}} to accommodate the number of ranks.		* Aim to set the number of ranks to the default value of {{TAG\|NBANDS}} divided by a small integer. Note that VASP will increase {{TAG\|NBANDS}} to accommodate the number of ranks.
	* Choose {{TAG\|NCORE}} as a factor of the cores per node to avoid communicating between nodes for the FFTs. Mind that {{TAG\|NCORE}} cannot be set with [[Combining MPI and OpenMP\|OpenMP]] threading and/or the [[OpenACC GPU port of VASP\|OpenACC GPU port]].		* Choose {{TAG\|NCORE}} as a factor of the cores per node to avoid communicating between nodes for the FFTs. Mind that {{TAG\|NCORE}} cannot be set with [[Combining MPI and OpenMP\|OpenMP]] threading and/or the [[OpenACC GPU port of VASP\|OpenACC GPU port]]. Use the number of OpenMP threads in this case to for fine grained control over parallelization.
	* The '''k'''-point parallelization is efficient but requires additional [[:Category:Memory\|memory]]. Given sufficient [[:Category:Memory\|memory]], increase {{TAG\|KPAR}} up to the number of irreducible '''k''' points. Keep in mind that {{TAG\|KPAR}} should factorize the number of '''k''' points. This is especially important to reduce MPI communication between [[Optimizing_the_parallelization#Understanding_the_hardware\|NUMA domains]] and compute nodes. Read also {{TAG\|KPAR}} for more information.		* The '''k'''-point parallelization is efficient but requires additional [[:Category:Memory\|memory]]. Given sufficient [[:Category:Memory\|memory]], increase {{TAG\|KPAR}} up to the number of irreducible '''k''' points. Keep in mind that {{TAG\|KPAR}} should factorize the number of '''k''' points. This is especially important to reduce MPI communication between [[Optimizing_the_parallelization#Understanding_the_hardware\|NUMA domains]] and compute nodes. Read also {{TAG\|KPAR}} for more information.
	* For bulk systems with small unit cells ({{TAG\|NBANDS}} is small, NKPTS=no of k points is large), {{TAGO\|NCORE\|1}} and {{TAGO\|KPAR\|NKPTS}} is optimal.		* For bulk systems with small unit cells ({{TAG\|NBANDS}} is small, NKPTS=no of k points is large), {{TAGO\|NCORE\|1}} and {{TAGO\|KPAR\|NKPTS}} is optimal.
	* For running on GPUs~~, you may try~~		* For running on GPUs:
	** MPI ranks = no of GPUs		** MPI ranks = no of GPUs
	** {{TAGO\|KPAR\|no of GPUs}} if memory allows		** {{TAGO\|KPAR\|no of GPUs}} if memory allows

← Older revision		Revision as of 07:58, 24 October 2025
Line 38:		Line 38:
	* The '''k'''-point parallelization is efficient but requires additional [[:Category:Memory\|memory]]. Given sufficient [[:Category:Memory\|memory]], increase {{TAG\|KPAR}} up to the number of irreducible '''k''' points. Keep in mind that {{TAG\|KPAR}} should factorize the number of '''k''' points. This is especially important to reduce MPI communication between [[Optimizing_the_parallelization#Understanding_the_hardware\|NUMA domains]] and compute nodes. Read also {{TAG\|KPAR}} for more information.		* The '''k'''-point parallelization is efficient but requires additional [[:Category:Memory\|memory]]. Given sufficient [[:Category:Memory\|memory]], increase {{TAG\|KPAR}} up to the number of irreducible '''k''' points. Keep in mind that {{TAG\|KPAR}} should factorize the number of '''k''' points. This is especially important to reduce MPI communication between [[Optimizing_the_parallelization#Understanding_the_hardware\|NUMA domains]] and compute nodes. Read also {{TAG\|KPAR}} for more information.
	{{NB\|tip\|If the number of '''k''' points is a prime number (or does not factorize well), copy the {{FILE\|IBZKPT}} file to {{FILE\|KPOINTS}} and add zero-weigthed points.}}		{{NB\|tip\|If the number of '''k''' points is a prime number (or does not factorize well), copy the {{FILE\|IBZKPT}} file to {{FILE\|KPOINTS}} and add zero-weigthed points.}}
	* For bulk systems with small unit cells ({{TAG\|NBANDS}} is small, NKPTS=no of k points is large), {{~~TAGO~~\|NCORE\|1}} and {{~~TAGO~~\|KPAR\|NKPTS}} is optimal.		* For bulk systems with small unit cells ({{TAG\|NBANDS}} is small, NKPTS=no of k points is large), {{TAG\|NCORE\|1}} and {{TAG\|KPAR\|NKPTS}} is optimal.
	* Lookout for the '''LOOP''' timer for each electronic minimization step, and the overall '''LOOP+''' timer that includes all electronic minimization steps plus post-processing like forces.		* Lookout for the '''LOOP''' timer for each electronic minimization step, and the overall '''LOOP+''' timer that includes all electronic minimization steps plus post-processing like forces.
	* For running on GPUs:		* For running on GPUs:
	** MPI ranks = no of GPUs		** MPI ranks = no of GPUs
	** {{~~TAGO~~\|KPAR\|no of GPUs}} if memory allows		** {{TAG\|KPAR\|no of GPUs}} if memory allows
	** [[Combining_MPI_and_OpenMP\|OpenMP threading]] can be quite important for the parts that still run on the host, because usually no of GPUs is rather small.		** [[Combining_MPI_and_OpenMP\|OpenMP threading]] can be quite important for the parts that still run on the host, because usually no of GPUs is rather small.
	** {{TAG\|NSIM}} controls how many orbitals are worked on in parallel. For GPUs the default value of 4 is way too small. Increase the value. Values as large as {{~~TAGO~~\|NSIM\|32}} might be beneficial.		** {{TAG\|NSIM}} controls how many orbitals are worked on in parallel. For GPUs the default value of 4 is way too small. Increase the value. Values as large as {{TAG\|NSIM\|32}} might be beneficial.
	** small systems might perform very poorly in standard DFT (this is different for hybrid, GW, or BSE calculations). GPUs show their benefit when there are many bands or ions over which the code is parallelized.		** small systems might perform very poorly in standard DFT (this is different for hybrid, GW, or BSE calculations). GPUs show their benefit when there are many bands or ions over which the code is parallelized.
	* Finally, use the {{TAG\|IMAGES}} tag to split several VASP runs into separate calculations. The limit is dictated by the number of desired calculations.		* Finally, use the {{TAG\|IMAGES}} tag to split several VASP runs into separate calculations. The limit is dictated by the number of desired calculations.
Line 78:		Line 78:
	Another point that immediately follows from the concept of NUMA domains is that MPI ranks should not jump between NUMA domains. They would loose access to their previous cache. To avoid this one uses ''pinning'' or ''binding'' of MPI ranks. This becomes also important for combining MPI with OpenMP threading. Threads should live close to their parent rank and also not move to share the same caches, i.e. live in the same NUMA domain. Binding ranks and threads to specific cores or regions depends on the software setup. For example in the popular SLURM jobscript submission system this is done via the flag <code>--cpu-bind=cores</code> , for openmpi this is done via <code>--bind-to=core</code>, and for intel MPI via <code>-genv I_MPI_PIN=ON</code>. To bind threads use the environment variable <code>OMP_PROC_BIND</code> and set it to <code>true</code> or <code>close</code>. For threads it is also important to tell them where they should run via <code>OMP_PLACES</code>, which best is set to <code>cores</code>.		Another point that immediately follows from the concept of NUMA domains is that MPI ranks should not jump between NUMA domains. They would loose access to their previous cache. To avoid this one uses ''pinning'' or ''binding'' of MPI ranks. This becomes also important for combining MPI with OpenMP threading. Threads should live close to their parent rank and also not move to share the same caches, i.e. live in the same NUMA domain. Binding ranks and threads to specific cores or regions depends on the software setup. For example in the popular SLURM jobscript submission system this is done via the flag <code>--cpu-bind=cores</code> , for openmpi this is done via <code>--bind-to=core</code>, and for intel MPI via <code>-genv I_MPI_PIN=ON</code>. To bind threads use the environment variable <code>OMP_PROC_BIND</code> and set it to <code>true</code> or <code>close</code>. For threads it is also important to tell them where they should run via <code>OMP_PLACES</code>, which best is set to <code>cores</code>.

	In summary, to leverage the potential of this hardware in the best way, we want to have groups of 8 cores work as independent as possible. Hence, we should choose {{TAG\|NBANDS}}, and the number of '''k'''-points in the IBZ such that they can be divided by 8. Then, we will find a good setting with {{~~TAGO~~\|NCORE\|8}} and {{~~TAGO~~\|KPAR\|4}} (or 4 times the number of compute nodes with this processor you have).		In summary, to leverage the potential of this hardware in the best way, we want to have groups of 8 cores work as independent as possible. Hence, we should choose {{TAG\|NBANDS}}, and the number of '''k'''-points in the IBZ such that they can be divided by 8. Then, we will find a good setting with {{TAG\|NCORE\|8}} and {{TAG\|KPAR\|4}} (or 4 times the number of compute nodes with this processor you have).

	For benchmarking, start with this settings. Then, try to change the parallelization tags and observe how performance changes.		For benchmarking, start with this settings. Then, try to change the parallelization tags and observe how performance changes.