Dear all,
I am running a molecular dynamics (MD) simulation using the machine-learning force field (MLFF) module in VASP. The initial MLFF training and MD run worked perfectly on 192 cores, producing the ML_AB file as expected. However, when I attempt to restart the calculation to continue the MD run using the previously trained force field, I encounter an out-of-memory (OOM) error during initialization.
My restart input flags are:
Code: Select all
ML_LMLFF = T
ML_ISTART = 1
ML_WTSIF = 2Upon checking the ML_LOGFILE, I see the following line:
Code: Select all
Total memory consumption : 3411.2Given that my node has 336 GiB of total memory, multiplying 3411 MiB by 192 cores indeed exceeds the available memory.
What I find confusing, though, is that the initial MLFF run (which trained the model and produced this ML_AB file) ran without any memory issue using the same number of cores and nodes.
My questions are:
* Am I misunderstanding how memory allocation works when restarting with ML_LMLFF = T?
* Does VASP allocate the full MLFF memory on each MPI rank during restart, whereas it may have distributed it differently during the initial training phase?
* Is there a recommended way to reduce memory usage when reloading an existing MLFF (e.g. fewer ranks, different flags)?
As a possible workaround, I also tried switching from pure MPI to hybrid MPI/OpenMP parallelization to reduce the number of MPI ranks. While this alleviates the memory issue, I observe a significant (factor 2-3) drop in computational efficiency. As such, I do not prefer to use the hybrid parallelization scheme if it can be avoided.
Any insights or clarifications would be greatly appreciated!
Best regards,
Ivo

