error in MLFF refit

Message

yingma · #1 Post by **yingma** » Wed Oct 08, 2025 7:09 pm

Hello all,

I'm having an error when running MLFF refit after a successfully training run. The error message is listed below, and all input files (except ML_AB) is attached. It looks like it is related to memory allocation, but I'm not sure what exactly was wrong. Interestingly, I performed about 10 similar calculations (e.g., the same input except at a different temperature), 7 had the same error while the other 3 completed without issue. If switching from refit to select, no error at all.

Any ideas on how to fix the problem is greatly appreciated.

[lm02:183001:0:183001] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffff57644b090)
[lm02:183009:0:183009] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffff57644b090)
==== backtrace (tid: 182955) ====
0 0x0000000000055799 ucs_debug_print_backtrace() ???:0
1 0x0000000000012c20 .annobin_sigaction.c() sigaction.c:0
2 0x0000000001a15dc1 __intel_avx_rep_memset() ???:0
3 0x0000000001968fac for_array_initialize() ???:0
4 0x000000000196933a for_alloc_and_init() ???:0
5 0x0000000000588d54 regression_mp_blea_mb_() ???:0
6 0x000000000065a345 force_field_mp_gen_ff_mb_() ???:0
7 0x0000000000659e01 force_field_mp_gen_ff_() ???:0
8 0x00000000006b7c30 ml_main_subroutines_mp_istart_0_1_2_4_driver_() ???:0
9 0x00000000006ad0ea ml_main_subroutines_mp_machine_learning_init_lib_() ???:0
10 0x0000000000e6f830 ml_interface_mp_machine_learning_init_() ???:0
11 0x00000000018f1390 MAIN__() ???:0
12 0x000000000040c2fd main() ???:0
13 0x0000000000023493 __libc_start_main() ???:0
14 0x000000000040c21e _start() ???:0

#2 Post by **martin.schlipf** » Thu Oct 09, 2025 6:57 am

Thank you for this report. I will discuss this with out machine learning experts and get back to you.

But one thing, I notice at a glance is that you need 3Gb/core (see ML_LOGFILE MEMORY INFORMATION). If you run on 64 cores, you need 192 Gb of memory on the node. Does your node have this amount of memory?

#3 Post by **martin.schlipf** » Thu Oct 09, 2025 7:19 am

I received some feedback from our ML experts. They recommend that you trim down your INCAR file and only use

Code: Select all

ML_LMLFF = .TRUE.
ML_MODE = REFIT

By setting ML_MB=4000 and ML_MCONF=3000 you request that the code uses this amount of memory even though you only have 1874 training structures. By not setting them in the INCAR file, you give VASP the freedom to set these values to the minimal required settings. While the rest of the INCAR tags should not directly affect the refitting of the force field, we still recommend not setting them in the INCAR file. VASP will try to set reasonable default values, e.g., not allocate some arrays if you only want to use the MLFF code branch. However, user choices overwrite these defaults which may increase the memory footprint.

If the INCAR file with only the two tags above does not solve your issue, then please provide also the ML_AB file so that we can run the calculation on our machines.

yingma · #4 Post by **yingma** » Thu Oct 09, 2025 1:52 pm

Thank you for the response.

I removed ML_MCONF and ML_MB and the same error persisted. My system does have sufficient memory (~1TB), and I intentionally used a larger ML_MCONF and ML_MB trying to see if the error will go away. Also, I'm using Intel oneapi (compiler, mkl, and mpi), and both 2021.3.0 and 2025.2 versions failed to the same error. I'm running on AMD EPYC 7452.

Link for ML_MB below. Thank you!

https://universityofwieauclaire-my.shar ... A?e=Wiprhf

#5 Post by **martin.schlipf** » Fri Oct 10, 2025 7:28 am

I tried to get as close as possible to your setup and ran with the Intel 2025.2 compiler and Intel MPI 2021.16 on an EPYC 7713 with 64 cores. I used an INCAR file that only contains

Code: Select all

ML_LMLFF = .TRUE.
ML_MODE = REFIT

and no other tags as recommended above. That did not show any issue.

Please make sure you try exactly this setup and the error persists. If that does not work, you could try adding ML_IALGO_LINREG = 1 to this INCAR file. As another alternative you could try to set ML_MODE = refitbayesian. That should not be used for production but can help us to discover the source of the error.

Finally, perhaps you can also send us the exact setup that you used, i.e., the version of the compiler and the makefile.include used to generate the executable.

yingma · #6 Post by **yingma** » Fri Oct 10, 2025 5:05 pm

Thank you. I tried three test runs with only two MLFF tags as recommended: a refit, a refitbayesian, and one using ML_IALGO_LINREG = 3 (ML_IALGO_LINREG=1 does not seems to be compatible with refit). Only refitbayesian completed without error.

I used ifx 2025.2.1, and intel mpi 2021.16.

All the input/output files, together with the makefile.include, can be download using the link below.

https://universityofwieauclaire-my.shar ... w?e=J7qxhx

#7 Post by **martin.schlipf** » Fri Oct 10, 2025 7:28 pm

Thank you for the makefile.include. It looks like you used an makefile.include from a previous VASP version and then manually adjusted it for the current version. Can you try compiling the code with a more recent one? For your setup arch/makefile.include.oneapi should be a good starting point.

yingma · #8 Post by **yingma** » Sun Oct 12, 2025 12:55 am

Update the makefile.include did resolve the problem. Thank you very much for the help!

VASP Forum

error in MLFF refit

error in MLFF refit

Re: error in MLFF refit

Re: error in MLFF refit

Re: error in MLFF refit

Re: error in MLFF refit

Re: error in MLFF refit

Re: error in MLFF refit

Re: error in MLFF refit