Page 1 of 1

Segmentation Fault when Using an ML_FF in LAMMPS

Posted: Tue Nov 11, 2025 12:32 am
by nuwan_dewapriya

Dear VASP Developers,

We encountered a segmentation fault when using an ML_FF in LAMMPS. After some testing, we found a possible workaround — maintaining at least ~150 atoms per rank and ensuring that the total number of ranks is a multiple of 32 seems to prevent the error.

We wanted to bring this behavior to your attention and would greatly appreciate your insights on the following questions:

- How does the VASPML library distribute data across MPI ranks?

- Why might some ranks end up with an empty list of atoms?

- Is the code resilient to cases where a rank receives an empty descriptor list?

- Is there a deterministic rule governing the minimum or maximum number of MPI ranks that the VASPML library can handle?

Thank you for your time and assistance.

Regards,
Nuwan


Re: Segmentation Fault when Using an ML_FF in LAMMPS

Posted: Tue Nov 11, 2025 9:43 am
by andreas.singraber

Hello Nuwan!

Thank you for reporting this issue! The problem you described hints at a bug we discovered a while ago (unfortunately shortly after the VASP 6.5.1 release). The force field prediction did not work correctly (and could segfault) if an individual MPI rank at a given time step does not own any local atoms (in LAMMPS language these are the atoms which an MPI rank treats as the "central" atoms, i.e., for which the energy contribution \(E_i\) and all correspdonding force contributions are computed). This usually does not happen in a bulk simulation except if the number of MPI ranks is close or even higher than the total number of atoms in the system. Other possible cases would be slabs or systems with a gas phase where no load balancing is applied and hence the domain decomposition scheme assigns vacuum regions to some MPI ranks. Maybe you had similar conditions during your simulation? The good news is that we fixed this bug already in our current development branch and hence the bugfix will be part of the upcoming release. I am glad you found a workaround in the meantime and I am sorry for the inconvenience this error caused!

To be really sure that our fix already covers your issue it would be very helpful if you could provide a minimal reproducer containing all relevant input files needed to run the simulation (LAMMPS input script, ML_FF, start configuration, submit script,...). Please also add relevant output files like the LAMMPS log file.

Some additional comments to answer your specific questions: When coupling with LAMMPS the VASPml library does not distribute the atoms across MPI ranks itself but instead relies on the domain decomposition scheme LAMMPS applies. Hence, it is out of the library's control how many atoms each MPI rank receives for evaluation of the force field. Nevertheless, the library should of course deal with any number including zero. Theoretically, VASPml should be able to handle any number of MPI ranks, however, we highly recommend to actually benchmark the parallel efficiency before starting any production runs. The number of cores which can be used effectively in an MD simulation with a machine-learned force field is typically much lower than the number of cores used for running the corresponding ab initio calculation (with a comparable number of atoms). Typically we expect that there is still decent parallel efficiency[1] if there are around 100-1000 atoms available per MPI rank. The actual numbers vary a lot between systems and force field settings (e.g. cutoff radius, number of radial basis functions, etc.) and hence should be tested for each application case individually.

[1] Parallel efficiency is often measured by dividing the actual speed-up with respect to serial execution by the number of cores used (the naively expected maximum speed-up) and expressing the result as percentage. For example, if your code runs 26 times faster on 32 cores than on a single core, the parallel efficiency is roughly 81%. It is important to keep an eye on the parallel efficiency because low numbers mean increased (energy) costs with practically identical (or even higher) simulation runtimes.

All the best,
Andreas Singraber


Re: Segmentation Fault when Using an ML_FF in LAMMPS

Posted: Tue Nov 11, 2025 2:54 pm
by nuwan_dewapriya

Hello Andreas,

Thank you very much for your detailed and helpful response.

As you pointed out, the issue was related to the vacuum region. We were able to eliminate the problem entirely by removing the vacuum layer and using a shrink-wrap boundary condition in LAMMPS instead.

I also examined the parallel efficiency (see the table below), and it appears that maintaining around 100 atoms per rank provides reasonable performance. For reference, this test was performed in LAMMPS with 9,300 atoms and 5 atom types.

cpus atom/cpu ts/s efficiency
1 9300 0.025 100%
36 258 0.706 78%
72 129 1.412 78%
108 86 1.976 73%
144 65 2.514 70%
180 52 3.004 67%

Thanks again for your help and clarification.

Best regards,
Nuwan


Re: Segmentation Fault when Using an ML_FF in LAMMPS

Posted: Wed Nov 12, 2025 9:36 am
by andreas.singraber

Hello Nuwan,

thanks for the update! Now it seems pretty clear that the problem was indeed caused by the bug I described. Great that you found the expected behavior of the parallel efficiency, this is very valuable feedback, thank you!

All the best,
Andreas Singraber