Page 1 of 1
Errors in ML_FF Training
Posted: Wed Oct 15, 2025 12:34 pm
by qingyu_wang
Hello everyone,
When utilizing the VASP 6.5.1, I encountered the following issue: the AIMD simulations with machine learning enabled could run normally when submitted with 128 cores. However, when the same simulations were submitted with 64 cores or 48 cores, convergence problems arose, and the tasks ultimately terminated as they failed to meet the convergence criterion within the set number of NELM. The result figures are attached. Furthermore, when the task is run with 32 cores, it reports a Segmentation Fault.
For the compilation of VASP, modifications were made to four compilation settings in the “makefile.include.intel” file: the Fortran compiler was changed from “FC = mpiifort” to “FC = mpiifx”; the linker was adjusted from “FCL = mpiifort” to “FCL = mpiifx”; the C compiler library was altered from “CC_LIB = icc” to “CC_LIB = mpiicx”; and the C++ parser was modified from “CXX_PARS = icpc” to “CXX_PARS = icpx”.
The computational hardware employed is a single node equipped with two CPUs, featuring 128 cores and 256 GB of memory in total.
I would like to inquire about the solutions to address this convergence issue under the 64-core and 48-core configurations.
Best wishes!
Re: Errors in ML_FF Training
Posted: Thu Oct 16, 2025 6:13 am
by martin.schlipf
It looks strange that the density changes and is not close to an integer number. That may indicate some issue with the setup or with the determination of the Fermi occupancies. Please provide the input and output files for all different setups so that we can inspect this. If you did not remove the CHGCAR and WAVECAR between runs, please also instruct us in which order you ran the calculations.
One thing, I noticed is that you started from the makefile.include.intel. We provide a makefile.include.oneapi which should be a better starting point for you.
Re: Errors in ML_FF Training
Posted: Thu Oct 16, 2025 7:09 am
by qingyu_wang
Dear professor,
Thank you very much for your reply! I have attached my INCAR,POSCAR,and POTCAR,with the only difference being the number of cores used when I submitted the tasks.
And I will recompile, starting with the makefile.include.oneapi you recommended.
Best wishes!
Re: Errors in ML_FF Training
Posted: Thu Oct 16, 2025 9:16 am
by qingyu_wang
Dear professor,
I have compiled the version corresponding to makefile.include.oneapi for VASP 6.4.3; unfortunately, the aforementioned issues still persist.Furthermore, this task of mine can run successfully on another person's device, with VASP versions 6.4.3 and 6.5.0 installed on their end.
Best wishes!
Re: Errors in ML_FF Training
Posted: Fri Oct 17, 2025 6:25 am
by martin.schlipf
I see that you are running an MD simulation to train a force field. After how many steps does this error occur? Is it right at the start in the first few iterations or after several 100 steps. In the latter case, it could possibly be that the MD simulation is a bit unstable and then you may run into issues depending on the random seed of the calculation. It would be good to see the OUTCAR files in that case.
If it happens early, can you tell me how many steps I need to run to observe the issue?
Re: Errors in ML_FF Training
Posted: Fri Oct 17, 2025 6:27 am
by qingyu_wang
Dear Professor。
Thank you very much for your reply. The error I encountered occurred right at the first step.
Best wishes!
Re: Errors in ML_FF Training
Posted: Fri Oct 17, 2025 1:10 pm
by martin.schlipf
I can reproduce the issue and need to investigate it further to understand where it is coming from
Re: Errors in ML_FF Training
Posted: Mon Oct 20, 2025 9:11 pm
by martin.schlipf
Preliminary results indicate that something is wrong in constructing the Hamiltonian. If you look at the eigenvalues during the electronic SCF (set NWRITE = 3), you will see that the eigenvalues are not stable. Often this can be a sign of a ghoststate when the PAW is not appropriate for the system. In those cases, you need to include more states into the valence window (Ag_pv or Ag_sv). However in your case that does not seem to help. Also, it would be strange if that issue would depend on the amount of cores used.
Our best hypothesis so far is that there is an issue with the particular toolchain you use. We have experienced similar issues in the past when running with IntelMPI on an AMD EPYC machine. In that case switching to an AOCL toolchain solved the issue. That seems to work for your case as well. You may compile VASP with a different toolchain and try for yourself. I will still try to determine why the Intel MPI produces this issue.
Re: Errors in ML_FF Training
Posted: Tue Oct 21, 2025 11:30 am
by qingyu_wang
Dear Professor,
Thank you very much for your assistance. My problem has been resolved, and the reason is that my CPU is an AMD one, so I should compile it in the manner specific to AMD. Thank you again for your help!
Best wishes!