ML_MB and MCONF getting too large, exceeding memory

Message

Poonam_Chauhan · #1 Post by **Poonam_Chauhan** » Mon Jun 22, 2026 7:28 am

Hi all,
I'm trying to train a force field for my interface system between two solids at different temperatures. I've first tried training the force field separately for bulk of the two solids first and then interface between them. My force field training for bulk system seemed to run fine at different temperatures, but whenever I tried training the interface, the bayesian error is very high, giving rise to a large ML_MB configuration which is eventually exceeding my system memory. I'm using HPC cluster with upto 700 gb ram and 48 cores per node. I want to solve this memory issue without compromising too much of accuracy.
My input file is as follow:

SYSTEM=interface
#Start parameter
ISTART = 0
ICHARG = 2
ISMEAR = 0
ISYM = 0
SIGMA = 0.04
ENCUT = 500
PREC =Normal
LREAL = Auto
ALGO= Fast
EDIFF = 1E-6
IVDW = 12
LASPH = .TRUE.
#MD SETTINGS
IBRION = 0
ISIF = 3
NSW = 20000
POTIM = 2
NCORE = 4
#THERMOSTAT
MDALGO = 3
TEBEG = 300
TEEND = 300
LANGEVIN_GAMMA = 10 10 10 10

#MACHINE LEARNING
ML_ISTART = 1
ML_LMLFF = .TRUE.
ML_MODE = TRAIN
ML_RCUT1 = 6.0
ML_RCUT2 = 5.0
ML_EPS_LOW = 1E-7
ML_ICRITERIA = 1
ML_MCONF = 6000
#ML_CX = -0.1
ML_MB = 4000

#2 Post by **michael_wolloch** » Mon Jun 22, 2026 9:37 am

Dear Poonam Chauhan,

Please provide more input and output files to create a minimal reproducible example.
Especially the ML_LOGFILE, but also all input files (including ML_AB files from the bulk trainings if they are not too large). You can upload them to the forum as a compressed tarball (*.tar.gz).

It would also be very helpful if you could provide the exact steps you took to train the separate bulk systems and how you combined them for the interface.

Thanks, Michael

Poonam_Chauhan · #3 Post by **Poonam_Chauhan** » Wed Jun 24, 2026 5:54 am

These are my input files and output files. For generating the ML_AB file two bulk system, my workflow goes as
1. Firstly I trained the MLFF for one bulk system at different temperatures using ML_ABN file from training at lower temperature as input ML_AB file for the next higher temperature and so on.
2. Then after completely training MLFF for one bulk system, I'm again taking the generate ML_ABN file as input ML_AB for another bulk system.

ML_AB_for_bulk.tar

Interface_file.tar

There are two sulfurs with different oxidation states in the interface so I also renamed them to differentiate between the two.

NOTE: The ML_AB file for the bulk is the one produced as minimal working example.

#4 Post by **michael_wolloch** » Thu Jun 25, 2026 10:35 am

Hello,

Thanks for uploading the files.

I noticed that the POTCAR in your Interface directory does not match the POSCAR.
It contains:
Na, P, S, and S pseudopotentials (matching your first bulk system with the two sulfurs with different oxidation states.)
However, your POSCAR has:
Na, P, S, Si (this should probably be Na, P, S, S8, Si, to also match the ML_AB file you are using.

So I would append the Si POTCAR, and make sure that the POSCAR you are using for the interface is set up to also split the sulfur atoms with respect to oxidation state!

When I run your calculation with the input you gave me, it reports in the stdout:

Code: Select all

WARNING: type information on POSCAR and POTCAR are incompatible
POTCAR overwrites the type information in POSCAR
typ   4 type information:  Si S

So this fix is important for getting the correct results, but not for the source of the crash, since even after fixing it, I run out of memory immediately with more than 16 MPI ranks on a node with 500 GB of RAM.

A couple of other notes:

ML_ISTART is deprecated and replaced by ML_MODE. You are using both, which is unnecessary and a bit confusing get rid of ML_ISTART!
ENCUT=500 seems a bit high when the highest ENMAX in your POTCAR is 258. 350 should be enough. Not a mistake, but probably inefficient.
Your smearing with SIGMA is very small. This can be bad for electronic convergence. I would increase this to 0.1 unless you have a specific reason it is so low, although the tests I was running converge reasonably fine with your setting.
You set ML_MCONF = 6000 in your INCAR file. This increases memory consumption and is part of your Problem. Without it the default would be the value from ML_AB + 1500 = 3703 + 1500 = 5203. Since the memory demand grows linearly with ML_MCONF, this should be avoided unless you have very specific intentions for increasing the parameter. Since you are running at the same temperature, it is doubtful that you will require so many additional configurations.

Now for the main issue, the unexpectedly large memory consumption:

Your ML_LOGFILE gives you an estimate for total memory consumption of about 10.5 GB, so it should fit comfortably on your machine with 700 GB RAM. However, this number is only a rough estimate, AND it is only for the master MPI rank (rank 0). Since VASP is parallelized mainly over MPI, a lot of the arrays are allocated separately for each MPI rank, and memory consumption scales with the number of MPI ranks.

This can be circumvented in part by compiling the code with shared memory support. In your case, this would be especially the

Code: Select all

-Duse_shmem

option, which reduces memory footprint for MLFF and GW calculations.

If you have not already, it also helps to use the _gam version of VASP, since you are only dealing with the Gamma point in your system.

I was able to run your system for several hundred steps without issues on 16 MPI ranks with the standard executable, using a maximum of 490 GB of RAM. Recompiling with

Code: Select all

-Duse_shmem
-Dsysv

and using the gamma-only version, I was able to use 64 MPI ranks and only use 441 GB of RAM. So I am confident that you will be able to get your machine to run it easily with 48 MPI ranks once you enable shared memory support.

Let me know if this resolves your Problems,
cheers, Michael

Poonam_Chauhan · #5 Post by **Poonam_Chauhan** » Thu Jun 25, 2026 2:52 pm

Dear Michael,
Sorry for the confusion, the Si in the POSCAR is actually supposed to be S8 as it is also a sulfur element with different oxidation state from Na3PS4. The S8 in the poscar was changed to Si just for visualization purposes. My vasp is already compiled using -Duse_shmem. These are my compiler options for vasp build.
CPP_OPTIONS = -DHOST=\"LinuxIFC\" \
-DMPI -DMPI_BLOCK=8000 -Duse_collective \
-DscaLAPACK \
-DCACHE_SIZE=4000 \
-Davoidalloc \
-Dvasp6 \
-Dtbdyn \
-Dfock_dblbuf\
-Duse_shmem

My main issue lies with the high bayesian error and high RMSE force error while training the force field and the continuous increase of ML_MB and MCONF for the interface system that don't seem to converge at all.
The provided ML_AB file is not the actual one trained for the bulk as it was too large of a file to upload.

Corrected_Interface_file.tar

Kindly help me with the high RMSE force and RMSE energy errors.

VASP Forum

ML_MB and MCONF getting too large, exceeding memory

ML_MB and MCONF getting too large, exceeding memory

Re: ML_MB and MCONF getting too large, exceeding memory

Re: ML_MB and MCONF getting too large, exceeding memory

Re: ML_MB and MCONF getting too large, exceeding memory

Re: ML_MB and MCONF getting too large, exceeding memory