Page 1 of 1

ML_ISTART = 1 doesn't work with different element types - v6.4.1

Posted: Wed Jun 28, 2023 2:26 pm
by john_martirez1
ML_AB file has H O, training for system with H O and C. Just terminates early, not even an SCF gets done.
OK with ML_ISTART = 0.

Re: ML_ISTART = 1 doesn't work with different element types - v6.4.1

Posted: Wed Jun 28, 2023 2:35 pm
by ferenc_karsai
Please send all necessary files to be able to run and check the calculation.
This means POSCAR, POTCAR, KPOINTS, INCAR, OUTCAR, ML_AB, ML_LOGFILE and stdout.

Re: ML_ISTART = 1 doesn't work with different element types - v6.4.1

Posted: Wed Jun 28, 2023 2:53 pm
by john_martirez1
thanks for the quick reply. See attached files.

Re: ML_ISTART = 1 doesn't work with different element types - v6.4.1

Posted: Wed Jun 28, 2023 3:29 pm
by john_martirez1
I found the reason, there's a significant jump in memory requirements from ML_ISTART = 0 to ML_ISTART = 1.
I increased mem/cpu to 9200 MB, and then it worked.

Hopefully, memory allocation can be improved in the future for ML_ISTART = 1? Or am I missing something?

Re: ML_ISTART = 1 doesn't work with different element types - v6.4.1

Posted: Mon Jul 03, 2023 9:25 am
by ferenc_karsai
It's hard to do anything about the memory allocation.
Here some explanations:
At the moment we have to statically allocate memory at the beginning, this is mainly due to the use of shared memory MPI. We saw several times that one gets problems if shared memory MPI needs to be reallocated. I don't know if this problem will be ever solved for all compilers.

So how can the memory grow so much in your case:
1) New element types entered the calculations. We use multidimensional allocatable arrays in fortran. So the local reference dimension will be allocated with the same maximum for all element types. Ideally one wants to have the same number of local reference configurations for all element types. Of course this is often hard to achive for dopands where we are limited by few atoms as local reference canditates from the training structures. In this case we waste some memory. Your case might belong to that.
2) It's a continuation run and if you don't specify anything then then on top of the already available data min(1500, NSW) is added. Please see the documentation of ML_MB and ML_MCONF (https://www.vasp.at/wiki/index.php/ML_MB and https://www.vasp.at/wiki/index.php/ML_MCONF). This default has worked until now quite nicely but if it turns out it's problematic for the majority of users then we will change that.

What can you do?
1) Check if you compiled with shared memory MPI ("-Duse_shmem").
2) Adjust ML_MB and ML_MCONF.
3) Go to a larger number of compute nodes since the design matrix which needs the most memory is distributed linearly over the number of cores.