Page 1 of 1

MD-on the fly-out of memory

Posted: Sat Nov 26, 2022 8:46 am
by gniding
Dear everyone:
I'm running the MD simulation using the on-the-fly method.But it has some problem,always show the out of memory after few steps.The errors message are show as follows:

slurmstepd: error: Detected 2 oom-kill event(s) in StepId=272349.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: cpu04: task 0: Out Of Memory
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=272349.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
slurmstepd: error: Detected 3 oom-kill event(s) in StepId=272349.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
slurmstepd: error: Detected 2 oom-kill event(s) in StepId=272349.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=272349.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=272349.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: First task exited 30s ago
srun: StepId=272349.0 tasks 1-38,40-90,92-112,114-156,158-182,184-191: running
srun: StepId=272349.0 tasks 0,39,91,113,157,183: exited abnormally
srun: launch/slurm: _step_signal: Terminating StepId=272349.0
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
slurmstepd: error: *** STEP 272349.0 ON cpu04 CANCELLED AT 2022-11-26T13:18:22 ***

If you have some ideas,please kindly let me know.
THANK YOU!

Re: MD-on the fly-out of memory

Posted: Mon Nov 28, 2022 7:27 am
by merzuk.kaltak
Please upload an error report (INCAR, POSCAR, POTCAR, KPOINTS, OUTCAR, OSZICAR).

Re: MD-on the fly-out of memory

Posted: Mon Nov 28, 2022 12:57 pm
by gniding
Thank you

Re: MD-on the fly-out of memory

Posted: Mon Nov 28, 2022 1:29 pm
by ferenc_karsai
Please also post your ML_LOGFILE and your ML_AB file.

Have you compiled with shared memory MPI (-Duse_shmem)?

Re: MD-on the fly-out of memory

Posted: Tue Nov 29, 2022 2:17 am
by gniding
Good morning!
No, the used VASP6 is not compiled with -Duse-shmem.

Re: MD-on the fly-out of memory

Posted: Fri Dec 02, 2022 12:28 pm
by ferenc_karsai
I think you simply run out of memory.

The beginning of the ML_LOGFILE shows the estimate for the memory required for each core. This is done before allocation, so it will be printed even if your calculation crashes due to insufficient memory.
In your case it is 12 GB per core. Do you really have this amount of memory?

How to bring the memory consumption down:
-) First of all compile with shared memory mpi (-Duse_shmem). This will hugely reduce the required memory.
-) You can also go to more cores. Many of the large arrays scale almost linearly with the number of cores.
-) You have set ML_MB=13000. That is really a high number. I have never run a calculation with that size. You can set a maximum value for ML_MB and then set "ML_LBASIS_DISCARD=.TRUE.". You will probably have to retrain from scratch and also do this in the previous calculations. Please also read our best practices site, where ML_LBASIS_DISCARD is further explained:
https://www.vasp.at/wiki/index.php/Best ... rce_fields