Hello Nuwan!
Thank you for reporting this issue! The problem you described hints at a bug we discovered a while ago (unfortunately shortly after the VASP 6.5.1 release). The force field prediction did not work correctly (and could segfault) if an individual MPI rank at a given time step does not own any local atoms (in LAMMPS language these are the atoms which an MPI rank treats as the "central" atoms, i.e., for which the energy contribution \(E_i\) and all correspdonding force contributions are computed). This usually does not happen in a bulk simulation except if the number of MPI ranks is close or even higher than the total number of atoms in the system. Other possible cases would be slabs or systems with a gas phase where no load balancing is applied and hence the domain decomposition scheme assigns vacuum regions to some MPI ranks. Maybe you had similar conditions during your simulation? The good news is that we fixed this bug already in our current development branch and hence the bugfix will be part of the upcoming release. I am glad you found a workaround in the meantime and I am sorry for the inconvenience this error caused!
To be really sure that our fix already covers your issue it would be very helpful if you could provide a minimal reproducer containing all relevant input files needed to run the simulation (LAMMPS input script, ML_FF, start configuration, submit script,...). Please also add relevant output files like the LAMMPS log file.
Some additional comments to answer your specific questions: When coupling with LAMMPS the VASPml library does not distribute the atoms across MPI ranks itself but instead relies on the domain decomposition scheme LAMMPS applies. Hence, it is out of the library's control how many atoms each MPI rank receives for evaluation of the force field. Nevertheless, the library should of course deal with any number including zero. Theoretically, VASPml should be able to handle any number of MPI ranks, however, we highly recommend to actually benchmark the parallel efficiency before starting any production runs. The number of cores which can be used effectively in an MD simulation with a machine-learned force field is typically much lower than the number of cores used for running the corresponding ab initio calculation (with a comparable number of atoms). Typically we expect that there is still decent parallel efficiency[1] if there are around 100-1000 atoms available per MPI rank. The actual numbers vary a lot between systems and force field settings (e.g. cutoff radius, number of radial basis functions, etc.) and hence should be tested for each application case individually.
[1] Parallel efficiency is often measured by dividing the actual speed-up with respect to serial execution by the number of cores used (the naively expected maximum speed-up) and expressing the result as percentage. For example, if your code runs 26 times faster on 32 cores than on a single core, the parallel efficiency is roughly 81%. It is important to keep an eye on the parallel efficiency because low numbers mean increased (energy) costs with practically identical (or even higher) simulation runtimes.
All the best,
Andreas Singraber