Freezing ML_FF select calculations

Message

julien_steffen · #1 Post by **julien_steffen** » Wed Sep 20, 2023 2:44 pm

We wanted to refit ML-FFs of ternary liquid metal systems (GaOPt), in which our generated ML-FFs have problems with the stability at the surface (atoms leaving the surface to the gas phase and leading to unphyiscal behavior).
Our goal was thus to increase the number of basis sets per atom type (ML_MB), which was kept to a maximum value of 3500 during the learning, to a new maximum of 8000, since the total number of collected reference structures in the ML_AB file was already 8862 and thus much larger than the preset ML_AB maximum.

For this refit, we tried to use the ML_MODE = select option (see INCAR file for details) (VASP version 6.4.1). Unfortunately, the calculations were unable to finish on our calculation clusters.
They managed the initialization phase (thus rendering out missing memory as an issue) but tended to freeze quite arbitrarily after collecting different numbers of reference configurations (between 2 and 2000), each time the calculation was started again.
When the calculations stopped, VASP did not exit but stayed in a zombie-mode and active in the queue of the cluster. Since this behavior remembered me on an issue I reported last year relating to ML-FF calculations of atoms with no neighboring atoms inside the cutoff radius, which was then resolved in the next VASP version, I think this might eventually be a bug regarding the selection mechanism and I therefore now report this as a bug.

I have attached the input and output files of the calculation which already froze after 2 processed configurations (since the ML_AB file has a size of more than 250 MB, I only attached the upper 1% of the file).

Thank you for your help!

#2 Post by **ferenc_karsai** » Thu Sep 21, 2023 7:10 am

We would need all the input files, especially the ML_AB file to reproduce the calculations.
Can you please make them available?

julien_steffen · #3 Post by **julien_steffen** » Thu Sep 21, 2023 9:18 am

I have now uploaded all input and important output files of the calculation in a cloud storage of my university:

https://faubox.rrze.uni-erlangen.de/get ... ANkUyW3Fc/

#4 Post by **ferenc_karsai** » Thu Sep 21, 2023 9:53 am

Thanks I got the calculation. It's really huge in terms of memory. In your case it needs 20-21 GB memory per core. So on the 42 ranks that you ran on (I saw that in the OUTCAR file), you need around 850-900 GB of memory in total. Do you have that available?

julien_steffen · #5 Post by **julien_steffen** » Fri Sep 22, 2023 12:06 pm

Yes I must admit it is quite huge, indeed the calculation led to out-of memory issues on other clusters.
The output I sent to you (and the other ones that got stuck and were frozen), however, originated from a calculation on a cluster with 1536 GB of memory in total (the largest we have), and no memory issues were reported there, so I at least assume that missing memory should not be a problem.

#6 Post by **ferenc_karsai** » Fri Sep 22, 2023 6:40 pm

Yes that should be enough, I just asked because sometimes I have experienced a weird behaviour when the calculation runs out of memory during allocation it gets stuck. Also there is lazy allocation which can lead to out of memory issues later (can be very annoying).

I will try to run the calculation, but I fear that our best machine has slightly not enough memory.
Then I will try with a smaller ML_MB, lets hope the problem is reproducable with different parameters.

#7 Post by **ferenc_karsai** » Fri Oct 06, 2023 7:01 pm

So unfortunately I have tried the calculation with ML_MB=3000 and the calculation runs well. I have also tried with ML_MB=5000 and 8000 but I run out of memory for those calculations.

Do you also get the hangup with ML_MB=3000? It would be really important to reduce the hangup to something that runs on our machines otherwise debugging is impossible.

Please also try with a clean INCAR file like this:
#### INCAR begin
SYSTEM = GaAgPt_test

ML_MB = 3000

ML_LMLFF = .TRUE. #Activation of ML

ML_ICRITERIA = 1
ML_MODE = select

ML_CDOUB=4.0
#### INCAR end

Don't set ML_MCONF because it will be set to the least required memory.
Also setting ML_MCONF_NEW to 12-16 will increase the required memory substantially. Compare the two following lines in the ML_LOGFILE:
Maximum number of local reference configurations of the ML force field : 3000 (I) ML_MB
Maximum number of local reference configurations in memory (max. buffer size before sparsification) : 4350

The larger number will be used for the allocation of the design matrix (in this case 4350*number of species in one dimension).

julien_steffen · #8 Post by **julien_steffen** » Thu Oct 19, 2023 9:09 am

Ok thank you for the suggestions! I will now start the calculations with the proposed settings and keep you updated whether the freezing still occurs.

julien_steffen · #9 Post by **julien_steffen** » Wed Oct 25, 2023 2:41 pm

I have now tried the calculation with your settings several times, the freezing still occurs after 1000-3000 configurations.
Input and output files are attached to this post, the ML_AB file is the same as in the initial post.

#10 Post by **ferenc_karsai** » Wed Nov 22, 2023 8:52 am

So I've tried the calculation again.

Runs through without problem.

On thing is I monitored the memory usage with grafana.
I've attached the graph for the memory usage of this job.

As you can see most of the time the memory usage is around 576 GB. The predicted value from the ML_LOGFILE is 9977.9 MB per core. On 64 cores that I'm running on this would add up to roughly 640 GB. So as expected the predicted value should be slightly higher than the actual value.
But there are the peaks that appear every few hours. With these peaks the memory usage can go up to 741 GB for a short amount of time. This is an all to one reduction of the covariance matrix and descriptors using scaLAPACK. This is not tracked in the memory usage at the moment since it is additional memory that needs to be allocated only on one core (so would not fit in the memory usage per core definition). Usually this overhead is not that large (it's within the difference of real and predicted memory used), but here it is definitely noticable.

I'm writing this so that when you run the calculation you make sure you are not just slightly have enough memory but leave a sizeable headroom in memory for potential spikes. So please try to make the calculation way smaller so that memory issues can be excluded.

Grafana_ML_job.png

My Community

Freezing ML_FF select calculations

Freezing ML_FF select calculations

Re: Freezing ML_FF select calculations

Re: Freezing ML_FF select calculations

Re: Freezing ML_FF select calculations

Re: Freezing ML_FF select calculations

Re: Freezing ML_FF select calculations

Re: Freezing ML_FF select calculations

Re: Freezing ML_FF select calculations

Re: Freezing ML_FF select calculations

Re: Freezing ML_FF select calculations