Memory issues when restarting MLFF simulation

Message

ifilot · #1 Post by **ifilot** » Thu Oct 09, 2025 8:25 am

Dear all,

I am running a molecular dynamics (MD) simulation using the machine-learning force field (MLFF) module in VASP. The initial MLFF training and MD run worked perfectly on 192 cores, producing the ML_AB file as expected. However, when I attempt to restart the calculation to continue the MD run using the previously trained force field, I encounter an out-of-memory (OOM) error during initialization.

My restart input flags are:

Code: Select all

ML_LMLFF  = T
ML_ISTART = 1
ML_WTSIF  = 2

Upon checking the ML_LOGFILE, I see the following line:

Code: Select all

Total memory consumption                      :   3411.2

Given that my node has 336 GiB of total memory, multiplying 3411 MiB by 192 cores indeed exceeds the available memory.
What I find confusing, though, is that the initial MLFF run (which trained the model and produced this ML_AB file) ran without any memory issue using the same number of cores and nodes.

My questions are:
* Am I misunderstanding how memory allocation works when restarting with ML_LMLFF = T?
* Does VASP allocate the full MLFF memory on each MPI rank during restart, whereas it may have distributed it differently during the initial training phase?
* Is there a recommended way to reduce memory usage when reloading an existing MLFF (e.g. fewer ranks, different flags)?

As a possible workaround, I also tried switching from pure MPI to hybrid MPI/OpenMP parallelization to reduce the number of MPI ranks. While this alleviates the memory issue, I observe a significant (factor 2-3) drop in computational efficiency. As such, I do not prefer to use the hybrid parallelization scheme if it can be avoided.

Any insights or clarifications would be greatly appreciated!

Best regards,
Ivo

#2 Post by **martin.schlipf** » Thu Oct 09, 2025 9:59 am

Could you attach the complete INCAR file? Perhaps you did not set ML_MCONF and ML_MB manually and then VASP increases the number compared to the initial run to make space for the new structures and configurations that you might encounter during your continuation run.

Unrelated to that: ML_ISTART is a deprecated INCAR tag. We recommend to use ML_MODE instead.

ifilot · #3 Post by **ifilot** » Thu Oct 09, 2025 10:50 am

Thanks already for your initial reply, please find attached the INCAR file.

#4 Post by **martin.schlipf** » Thu Oct 09, 2025 1:08 pm

Thank you for the INCAR file that seems to confirm what I wrote in my previous reply.

In the initial run, VASP automatically set the mentioned INCAR tags to 1500 since you did not set them in the INCAR file. When you restart the calculation, VASP will increase them by 1500 over the number that it found at the end of your first run. This is potentially much more and may exceed the memory you have available. As less-memory-intense alternative, you can set ML_MCONF manually to the number of structures in the ML_AB file plus 300 to allow adding some new structures.

Finally, sometime you may notice that you run out-of-memory only during the calculation instead of right at the beginning. This can happen, if the operating system decides to allocate the memory lazily. I.e., it "knows" that you are currently not using all of the memory that you allocated and only adds it when you actually put data there.

VASP Forum

Memory issues when restarting MLFF simulation

Memory issues when restarting MLFF simulation

Re: Memory issues when restarting MLFF simulation

Re: Memory issues when restarting MLFF simulation

Re: Memory issues when restarting MLFF simulation