Page 1 of 2

AIMD Calculations terminate after a short time

Posted: Fri Oct 21, 2022 12:06 pm
by xiliang_lian
Hello,

I have been using the AIMD module to simulate NaF which has 64 atoms. However, the calculations terminated after some time (around 1-3 ps, with a timestep of 1fs) but I was unable to find a solution. Therefore, I want to ask whether you could help with this.

The calculation terminates much more quickly when I do it on a larger supercell. Therefore, it might be a memory issue. Two types of error reports have been found:
---error due to memory
---error without specific instruction
I put the reports in the attachment.

I think the problem comes from using the GPU version of VASP. I use the CPU version of VASP before but this is not a problem with an even larger system (96 atoms, much more electrons). Therefore, the INCAR, KPOINTS, and POTCAR files should be fine.

I have done the following tests but they have all failed. I will upload one example in the attachment.
1. Increase the number of nodes
2. Play with ntasks-per-node since this will affect the number of cores assigned to each k-point
3. Decrease KPAR. NCORE doesn't work for me because I use OpenMP
4. Use ALGO=Fast or VeryFast instead of ALGO = All
5. GPU node with 16G and 32G memory

For the cluster I use, each GPU accelerated node consists of 40 CPUs (Intel Cascade Lake 6248 processors, usable memory 160 GB) and 4 Nvidia Tesla V100 SXM2 GPUs (either 16GB or 32GB, both have been tested). The VASP version is the newest 6.3.2 and was compiled by a supercomputer technician. In the attachment, you will also find one submission script I use.

I hope the information above and the attachment will help clarify my problem and give you sufficient information. If you need additional information, please let me know. Thanks a lot in advance.
error.zip
Best regards,
Xiliang

Re: AIMD Calculations terminate after a short time

Posted: Fri Oct 21, 2022 12:46 pm
by fabien_tran1
Hi,

According to your job.sh file, it seems that you are combining the use of GPUs and MPI. Here OpenACC_GPU_port_of_VASP it is mentioned that "Due to the use of NCCL, the OpenACC version of VASP may only be executed using a single MPI-rank per available GPU". May it be the problem of the crash? Could you please try without MPI (#SBATCH --ntasks-per-node=1)?

Re: AIMD Calculations terminate after a short time

Posted: Fri Oct 21, 2022 1:26 pm
by xiliang_lian
Hello,

Thanks a lot for your answer.

I also checked the page you shared and had a test with this. I set ntasks-per-node to 4 because each node has 4 GPUs available. The information for the calculation is this:
running 12 mpi-ranks, with 10 threads/rank
distrk: each k-point on 3 cores, 4 groups
distr: one band on 1 cores, 3 groups
OpenACC runtime initialized ... 12 GPUs detected
vasp.6.3.2 27Jun22 (build Sep 29 2022 16:11:41) gamma-only
With this, I don't have the warning you find in the standard output I attached. However, This doesn't work for me and results in the same error. Do you have further suggestions for this?

Best wishes,
Xiliang

Re: AIMD Calculations terminate after a short time

Posted: Fri Oct 21, 2022 1:47 pm
by fabien_tran1
Ok, but nevertheless have you tried with "#SBATCH --ntasks-per-node=1"?. I think that "#SBATCH --gres=gpu:4" should be enough to have all 4 GPUs on each node used.

Re: AIMD Calculations terminate after a short time

Posted: Mon Oct 24, 2022 7:20 am
by xiliang_lian
Hello,

Thanks for the explanation. Yes, in the meantime, I submitted another test. It got better but didn't solve the problem. I was able to get a trajectory of more than 4 ps (timestep 1fs) but memory issues led to the termination of the calculation again. I attached the standard out file. Can you please suggest other solutions?

In addition, the reason I didn't do the test with

Code: Select all

ntasks-per-node = 1
is that it is very inefficient. I did a comparison earlier and found that this setting leads to a very slow speed compared with

Code: Select all

ntasks-per-node=40
. The latter is almost three times faster. If you could also comment on this, it would be very helpful for me.

Thank you very much for taking the time to my problem.

Best wishes,
Xiliang
run.o345202.zip

Re: AIMD Calculations terminate after a short time

Posted: Mon Oct 24, 2022 7:37 am
by fabien_tran1
Hi,

During how many minutes (or hours) was the calculation with ntasks-per-node=40 running before it crashed? Did the calculation with ntasks-per-node=1 also crashed or did you stop it?

Re: AIMD Calculations terminate after a short time

Posted: Mon Oct 24, 2022 7:43 am
by xiliang_lian
Hi,

Sorry, I didn't keep track of the time. For ntasks-per-node=40, normally it crashes after around 3 hours. The error file I sent to you in the last message comes from ntasks-per-node=1, it crashes after around 4-6 hours. I didn't stop it. It stopped because of memory problems.

Best,
Xiliang

Re: AIMD Calculations terminate after a short time

Posted: Mon Oct 24, 2022 2:54 pm
by fabien_tran1
I will run the calculation myself to have a closer look. Meanwhile, could you try one calculation with
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=10
as well as setting the OMP_NUM_THREADS environment variable to 10, and commenting (or deleting) this line:
#SBATCH --hint=nomultithread

Re: AIMD Calculations terminate after a short time

Posted: Mon Oct 24, 2022 3:06 pm
by xiliang_lian
Okay, thank you very much. I will post the result as soon as I got it.

Re: AIMD Calculations terminate after a short time

Posted: Mon Oct 24, 2022 3:35 pm
by fabien_tran1
Which version of the NVIDIA compiler was used?

Re: AIMD Calculations terminate after a short time

Posted: Mon Oct 24, 2022 5:46 pm
by xiliang_lian
Hi, I use the following dependencies:

Code: Select all

 nvidia-compilers/21.9 cuda/11.2 openmpi/4.0.5-cuda intel-mkl/2020.4

Re: AIMD Calculations terminate after a short time

Posted: Tue Oct 25, 2022 6:46 am
by xiliang_lian
Hello,

I have finished the test. Again memory problems led to the termination after around 4 ps. I modified the calculation by changing the pseudopotential of Na to include p electrons (ZVAL=7 instead of ZVAL=1) but it should not matter.

In case you have doubts about the setup, I have attached the results. Thanks again for your kind help.

Best regards,
Xiliang
error1.zip

Re: AIMD Calculations terminate after a short time

Posted: Tue Oct 25, 2022 7:52 am
by fabien_tran1
Hi,

My calculation using your input files finished properly (the output files are attached). I used vasp_gam of VASP-6.3.2, two GPUs (16 GB each) and OMP_NUM_THREADS=10. Thus, I could not reproduce your problem.

Could you upload the makefile.include that was used to compile the GPU version of VASP-6.3.2 that you used?

Another thing you could do, if possible, is to run again the calculation (just a few minutes) and execute the command nvidia-smi (just once during the calculation) to see the memory usage of the GPUs (and show us what is displayed).

Actually, was the test suite run in GPU after the installation of VASP?

PS: In your last calculation the ordering of the atoms in the POTCAR is not ok.

Re: AIMD Calculations terminate after a short time

Posted: Tue Oct 25, 2022 8:37 am
by xiliang_lian
Hi,

Thanks a lot for your work and also for your comments on my last mistake.

I will ask the technician of the cluster to send me the makefile.include file. I will also check with our technician whether they have run the test suites. This might take some time but I will come back to you ASAP. I will also check the memory use and send it together next time.

Thanks again.

Best,
Xiliang

Re: AIMD Calculations terminate after a short time

Posted: Wed Oct 26, 2022 10:38 am
by xiliang_lian
Hello,

I got the "makefile.include" file as you will find attached. The output of nvidia-smi is also included in the zip file. I was not able to confirm whether our technician has checked the test suits or not because he is not available. Can you please have a look and let me know whether we have problems compiling the code or any other issues? Thanks in advance.

Best regards,
Xiliang
error2.zip