VASP 6.3.0 ACC OMP memory leak

Problems running VASP: crashes, internal errors, "wrong" results.

Moderators: Global Moderator, Moderator

Post Reply
Message
Author
Dankomaister
Newbie
Newbie
Posts: 16
Joined: Sat Feb 13, 2016 4:39 pm
License Nr.: 20-0400 5-1605

VASP 6.3.0 ACC OMP memory leak

#1 Post by Dankomaister » Tue May 03, 2022 1:36 am

Hi,

I have compiled the ACC+OMP version of VASP 6.3.0 with Intel mkl

Code: Select all

makefile.include.nvhpc_ompi_mkl_omp_acc
Using these compilers / libraries
CUDA/11.4.4
NVHPC/22.3
OpenMPI/4.1.3
imkl/2021.4.0
HDF5/1.12.1

I runs but unfortunately there is a huge memory leak (on the host side) as seen in the attached picture.
Any ideas what can cause this? I have tried playing around with different version of the compilers / libraries but can seems to solve this :/ Perhaps there is some bug?

I attached the makefile.include and one of the test systems for which I noticed the memory leak.

/Daniel
You do not have the required permissions to view the files attached to this post.

andreas.singraber
Global Moderator
Global Moderator
Posts: 99
Joined: Mon Apr 26, 2021 7:40 am

Re: VASP 6.3.0 ACC OMP memory leak

#2 Post by andreas.singraber » Mon May 09, 2022 2:13 pm

Hello!

Thanks for reporting this memory leak, I have already tried to reproduce this on our machines. Unfortunately I was not able to run exactly the same job as we do not have enough GPUs available. Also, I can not use the OpenMP parallelization together with OpenACC at the moment. To be able to test at least a similar job I modified the INCAR file (used standard IALGO, smaller ENCUT, disabled vdW-DF functionals). Even with this strongly modified setup I did get a (smaller) memory leak... it is of course not clear if it has the same origin as in your case. However, the memory leak I observed has its origin in libnvf.so which indicates that it resides not in our code but somewhere in the NVIDIA libraries. We are now getting in contact with NVIDIA for further support.

Can you please tell me which GPUs you were using? Did you observe a memory leak also without additional OpenMP parallelization? I am a bit confused about the memory graph you posted. It would be good if I could estimate the amount of leaked memory from the graph but I am not sure it really belongs to the output files you prepared. The start/end time does not match and the increase of memory in the graph lasts over 3 hours while the runtime in OUTCAR indicates about an hour of execution time. Is it just from another (longer) run with identical settings?

Thank you!

All the best,
Andreas Singraber

andreas.singraber
Global Moderator
Global Moderator
Posts: 99
Joined: Mon Apr 26, 2021 7:40 am

Re: VASP 6.3.0 ACC OMP memory leak

#3 Post by andreas.singraber » Mon May 09, 2022 3:37 pm

Hello again!

It seems our attempt to reproduce the memory leak was not yet successful. The memory increase in our modified setup I described in my last post turned out to stabilize after a while. So we probably simplified your example too much, we will continue our efforts, stay tuned...

Best,
Andreas

Dankomaister
Newbie
Newbie
Posts: 16
Joined: Sat Feb 13, 2016 4:39 pm
License Nr.: 20-0400 5-1605

Re: VASP 6.3.0 ACC OMP memory leak

#4 Post by Dankomaister » Tue May 10, 2022 5:12 am

Hi Andreas!

Thanks a lot for having a look at this!!!

I can confirm that the graph in my previous post is the same system with just longer runtime as you suggested.
I have also done more tests on my side using the same system but a simplified makefile.include (attached) which includes only the necessary changes to compile on our system, which is a small local HPC with 20 GPU nodes each with two Nvidia K80 GPUs. This is the system I'm trying to compile the VASP ACC version for. But I have also seen this memory leak on a different system with Nvidia V100 GPUs running a different (bigger) calculation. So my guess is that this is not related to our specific GPUs or my specific calculation.

However, I did find out that the memory leak is related to the use of OpenMP parallelization.
When setting OMP_NUM_THREADS=1 the memory usage is stable compared to OMP_NUM_THREADS=8 as seen in the figure.
This and other problems I have experienced with NVHPC+MKL+OpenMP parallelization makes me think its related to OpenMP.

For example linking MKL as follows

Code: Select all

-L${MKLROOT}/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_openmpi_lp64 -liomp5 -lpthread -lm -ldl
results in segmentation fault. But liking with the following

Code: Select all

-L${MKLROOT}/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_pgi_thread -lmkl_core -lmkl_blacs_openmpi_lp64 -pgf90libs -mp -lpthread -lm -ldl
or

Code: Select all

-Mmkl -L${MKLROOT}/lib/intel64 -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64
works fine (except for the memory leak)

It would be very nice if we can solve this problem since running without OpenMP parallelization is very detrimental.
Performance is around 50% higher using OMP_NUM_THREADS=8 compared to OMP_NUM_THREADS=1.

/Daniel
You do not have the required permissions to view the files attached to this post.

Dankomaister
Newbie
Newbie
Posts: 16
Joined: Sat Feb 13, 2016 4:39 pm
License Nr.: 20-0400 5-1605

Re: VASP 6.3.0 ACC OMP memory leak

#5 Post by Dankomaister » Tue May 24, 2022 8:27 am

Any updates on fixing this?

/Daniel

andreas.singraber
Global Moderator
Global Moderator
Posts: 99
Joined: Mon Apr 26, 2021 7:40 am

Re: VASP 6.3.0 ACC OMP memory leak

#6 Post by andreas.singraber » Tue May 24, 2022 11:49 am

Hello Daniel!

I am sorry, but there is not really anything we can say yet... we are waiting for NVIDIA to have a look at it. In the meantime I tried on 2 GPUs with 8 threads per rank. I could see some leakage but not as massive as you reported. The origin of this is again __fort_gmalloc_without_abort in libnvf.so but I am not so sure about this reporting...
memleak.png
Please stay tuned...

Best,
Andreas
You do not have the required permissions to view the files attached to this post.

Dankomaister
Newbie
Newbie
Posts: 16
Joined: Sat Feb 13, 2016 4:39 pm
License Nr.: 20-0400 5-1605

Re: VASP 6.3.0 ACC OMP memory leak

#7 Post by Dankomaister » Fri Aug 05, 2022 2:53 pm

No updated on this?
Just wanted to say we still have this problem affecting our users :(

/Daniel

Post Reply