Using metadynamics to train a machine-learned force field
It can be tricky to model transition states using static methods methods. Sometimes, this is insufficient and more time-consuming dynamic methods must be used. By using advance MD methods in combination with machine-learned force fields, the cost can be significantly reduced. Metadynamics is one such method, applying a biased potential on selected geometric parameters, to describe rare events. This tutorial is based on an older metadynamics how-to. The POSCAR, POTCAR, KPOINTS, ICONST, and PENALTYPOT files can be found there (or downloaded in a zip file at the end of this tutorial).
Input
INCAR.mlff
PREC=Normal EDIFF=1e-6 LWAVE=.FALSE. LCHARG=.FALSE. NELECT=22 NELMIN=4 LREAL=.FALSE. ALGO=VeryFast ISMEAR=0 SIGMA=0.05 POMASS = 12.011 4.0 35.453 ############################# MD setting ##################################### IBRION=0 # MD simulation NSW=40000 # number of steps POTIM=1 # integration step TEBEG=600 # simulation temperature MDALGO=1 # metaDynamics with Andersen thermostat ANDERSEN_PROB=0.05 # collision probability HILLS_BIN=50 # update the time-dependent bias # potential every 50 steps HILLS_H=0.0005 # height of the Gaussian HILLS_W=0.05 # width of the Gaussian ############################################################################## RANDOM_SEED = 311137787 0 0 # MLFF tags ML_LMLFF = .TRUE. ML_MODE = TRAIN ISIF=2
INCAR.refit
# MLFF tags ML_LMLFF = .TRUE. ML_MODE = REFIT ISIF=2
INCAR.md
POMASS = 12.011 4.0 35.453 ############################# MD setting ##################################### IBRION=0 # MD simulation NSW=100000 # number of steps POTIM=1 # integration step TEBEG=600 # simulation temperature MDALGO=1 # metaDynamics with Andersen thermostat ANDERSEN_PROB=0.05 # collision probability HILLS_BIN=50 # update the time-dependent bias # potential every 50 steps HILLS_H=0.005 # height of the Gaussian HILLS_W=0.05 # width of the Gaussian ############################################################################## RANDOM_SEED = 311137787 0 0 # MLFF tags ML_LMLFF = .TRUE. ML_MODE = RUN ISIF=2 ML_ESTBLOCK = 20 ML_OUTBLOCK = 20
INCAR.interactive
PREC=Normal EDIFF=1e-6 LWAVE=.FALSE. LCHARG=.FALSE. NELECT=22 NELMIN=4 LREAL=.FALSE. ALGO=VeryFast ISMEAR=0 SIGMA=0.05 POMASS = 12.011 4.0 35.453 ############################# MD setting ##################################### IBRION=0 # MD simulation NSW=185 # number of steps POTIM=1 # integration step TEBEG=600 # simulation temperature MDALGO=1 # metaDynamics with Andersen thermostat ANDERSEN_PROB=0.05 # collision probability HILLS_BIN=50 # update the time-dependent bias # potential every 50 steps HILLS_H=0.0005 # height of the Gaussian HILLS_W=0.05 # width of the Gaussian ############################################################################## RANDOM_SEED = 311137787 0 0 # MLFF tags ML_LMLFF = .TRUE. ML_MODE = TRAIN ML_ICRITERIA = 0 ML_NMDINT = 1 ML_MCONF = 2000 ISIF = 2 INTERACTIVE = .TRUE.
Step-by-step instructions
Step 0: Training the MLFF (optional)
Required files: POSCAR, POTCAR, INCAR.mlff, KPOINTS, ICONST, PENALTYPOT
The first step is to generate the machine-learned force fields (MLFFs). Run enough steps to have explored a large portion of the configuration space of the reaction. This step will not sample many points around the transition state, which you can account for later. Use the INCAR.mlff file INCAR file for this step. We have included an ML_AB file in the zip file below, so you can skip this step.
| Important: Make sure that there is the ICONST and PENALTYPOT files or the MD will not be metadynamics. |
cd e00_training vasp_std
Step 1: Refitting the MLFF
Required files: POSCAR, POTCAR, INCAR.refit, KPOINTS
Copy the ML_ABN from step 0 to ML_AB (or from the zip file) and refit. Use the INCAR.refit file for this.
cd e01_refit cp ../e00_training/ML_ABN ML_AB vasp_std
Step 2: Running the MD simulation
Required files: POSCAR, POTCAR, INCAR.md, KPOINTS, ICONST, PENALTYPOT
First, link to the MLFF that you have just generated: ln -s ../e01_refit/ML_FFN ML_FF. Then, run the molecular dynamics (MD) simulation using INCAR.md. Make sure that the PENALTYPOT file is present, as this is vital for containing the molecule and modeling the reaction using metadynamics. You will track the two C-Cl distances as collective variables using the ICONST file.
cd e02_MD ln -s ../e01_refit/ML_FFN ML_FF vasp_std
MD analysis
Once the metadynamics MD simulation is completed, plot the time evolution of the collective variables (cf. ICONST) using the timeEv.sh script and gnuplot:
bash ./timeEv.sh gnuplot -e "set terminal jpeg; set xlabel 'timestep'; set ylabel 'Collective variable (Ang)'; set style data lines; plot 'timeEvol1.dat', 'timeEvol2.dat'" > timeEvol.jpg
The obtained time evolution of the collective variables looks like the following:

ML_OUTBLOCK = 20 so there are 20 structures between each point), the two collective variables switch as the transition state is crossed and the system inverts.The Cl- ion is initially more distant from the C atom. After a few thousand steps, the chloromethane molecule inverts as the chloride attacks and the opposing chlorine atom is expelled as a chloride. After a thousand more steps, the MLFF breaks down, and a non-physical structure is formed at which the MD simulation gets stuck.
| Important: If the simulation does not fail, repeat the calculation, try increasing the number of ionic steps (NSW), until you run a simulation where the MLFF breaks down. Alternatively, decrease the number of NSW used for training the MLFF in the first place. This will train an MLFF with fewer structures, i.e., a worse force field that will break down sooner. We only recommend this in this exercise so that you can learn how to fix an MLFF. |
Spilling factor
This is when the simulation leaves the configuration space of the MLFF training set. You can see this by checking the spilling factor in the ML_LOGFILE:
grep SFF ML_LOGFILE
# SFF ######################################################################################################## # SFF This line shows the spilling factor, # SFF # SFF nstep ............ MD time step or input structure counter # SFF sfmax ............ Maximal spilling factor among atoms # SFF sfmin ............ Minimal spilling factor among atoms # SFF sfmean ........... Mean value of spilling factor # SFF sfvar ............ Variance of spilling factor # SFF threshold ........ Value of threshold criterion for learning # SFF ######################################################################################################## # SFF nstep sfmax sfmin sfmean sfvar threshold # SFF 2 3 4 # SFF ######################################################################################################## SFF 100 9.29675803E-06 2.13512863E-07 2.26596576E-06 1.01225022E-11 2.00000000E-03 SFF 200 1.11874507E-06 4.26678995E-07 7.93844208E-07 5.02375989E-14 2.00000000E-03 SFF 300 1.69448509E-06 2.46957156E-07 9.74621854E-07 2.15210308E-13 2.00000000E-03 ... SFF 99960 9.99309893E-01 3.36556878E-03 3.40857829E-01 2.16239137E-01 2.00000000E-03 SFF 99980 9.99464421E-01 5.99966976E-03 3.44462977E-01 2.14118191E-01 2.00000000E-03 SFF 100000 9.99581963E-01 5.29236322E-03 3.41837281E-01 2.15977938E-01 2.00000000E-03
We have set ML_OUTBLOCK = 20 so that only every 20th structure is included. Notice how sfmax goes to 1. This indicates that it is completely outside of the training set. You can plot this to visualize what happens and compare to the plot above. First, grep for the spilling factor SFF in the ML_LOGFILE:
grep '^SFF[[:space:]]*[0-9]' ML_LOGFILE | awk '{print $2 " " $3}' > sff.dat
Then plot sff.dat using Python:
import py4vasp
import numpy as np
step, spilling_factor = np.loadtxt("./e02_MD/sff.dat", unpack=True)
py4vasp.plot(step, spilling_factor, xlabel="MD step", ylabel="Spilling factor", label="Spilling factor")

Extracting new training structures
The structures after the spilling factor reaches 1 are not meaningful. However, those immediately before are very useful. These are where the MLFF begins to break down. You can extract these structures using the sff_grep.py script:
./sff_grep.py ML_LOGFILE XDATCAR 5E-4 0.8
This will extract from XDATCAR all of the structures in the configurations ML_LOGFILE that have a spilling factor between 5E-4 and 0.8 and output them to a POSCAR.interactive file. POSCAR.interactive can then be used in the next step to continue training the MLFF.
Important: POSCAR.interactive is a file containing the position of the ions for each configuration in direct coordinates. The POSCAR must still be included.
|
Step 4: Retrain the MLFF
Required files: POSCAR, POTCAR, INCAR.interactive, KPOINTS, ICONST, PENALTYPOT from old HILLSPOT
Take the POSCAR.interactive that you have generated in the preceding step, take the ML_AB file that you have already trained in step 0 (or refitted in step 1), along with the PENALTYPOT from step 2, and the INCAR.interactive file - making sure to update the NSW to the number of structures in POSCAR.interactive. Then run a calculation with the same POSCAR, POTCAR, and KPOINTS files from step 1, ensuring to input the POSCAR.interactive file into the VASP executable, enabled by setting INTERACTIVE = .TRUE., e.g.:
cd e03_interactive cp ../e00_training/ML_ABN ML_AB cp ../e00_training/HILLSPOT PENALTYPOT vasp_std < POSCAR.interactive
Important: Ensure that you update NSW in the INCAR to be equal to the number of configurations in your POSCAR.interactive file.
|
Step 5: Refit the MLFF
Required files: POSCAR, POTCAR, INCAR.refit, KPOINTS
Refit the MLFF as before, copying the ML_ABN to ML_AB and refitting with INCAR.refit.
cd e04_new_refit cp ../e03_interactive/ML_ABN ML_AB vasp_std
Step 6: Repeat the MD simulation
Required files: POSCAR, POTCAR, INCAR.md, KPOINTS, ICONST, PENALTYPOT from old HILLSPOT
Copy your old biased potential (HILLSPOT file) from your first MD run to the PENALTYPOT in your new directory. This saves you a lot of time by enabling you to begin from where the previous calculation failed. Then, run the calculation as normal:
cd e05_new_MD cp ../e02_MD/HILLSPOT PENALTYPOT ln -s ../e04_new_refit/ML_FFN ML_FF vasp_std
MD analysis
Once the calculation is finished, take a look at the time evolution of the collective variables using timeEv.sh script and gnuplot:
bash ./timeEv.sh gnuplot -e "set terminal jpeg; set xlabel 'timestep'; set ylabel 'Collective variable (Ang)'; set style data lines; plot 'timeEvol1.dat', 'timeEvol2.dat'" > timeEvol.jpg
The obtained time evolution of the collective variables should look like the following:

You can clearly see how often the green and purple lines switch, i.e., the inversion reaction occurs. This indicates that we are accurately modeling the transition state.
Spilling factor
You can confirm the stability of your MLFF by taking a look at the spilling factor:
grep '^SFF[[:space:]]*[0-9]' ML_LOGFILE | awk '{print $2 " " $3}' > sff.dat
Then plot sff.dat using Python:
import py4vasp
import numpy as np
step, spilling_factor = np.loadtxt("./e05_new_MD/sff.dat", unpack=True)
py4vasp.plot(step, spilling_factor, xlabel="MD step", ylabel="Spilling factor", label="Spilling factor")

Plotting the biased potential
As a final step, you can visualize the biased potential that you have generated. First, take the HILLSPOT file and, using the gaussians.py script, generate a file containing the sum of the gaussians used for the biased potential gaussian_sum_output.txt, removing the first 9 lines that contain the initial biased potential that restrains the reaction in place:
./gaussians.py HILLSPOT 9
Then, plot the biased potential with the contourplot.py script:
./contourplot.py
This will generate the following interactive plot:

You can set a cutoff for the z-axis, e.g., 0.05, to get a better look at the transition state region:
./contourplot.py 0.05
See how much more densely populated the transition state region is than from the previous MD:

We generated these scripts using an AI, so we recommend trying to do this yourself. Generating these scripts took around 5 minutes with only a few prompts.
Practical hints
- Make sure to set the mass of hydrogen to above 1, e.g.,
POMASS = 12.011 4.0 35.453, rather thanPOMASS = 12.011 1.0 35.453. This enables larger time steps for the MD simulation. - We found that reducing the ANDERSEN_PROB to 0.05 from 0.1 (see previous exercise).
- It is crucial to reduce the hill height (HILLS_H) by an order of magnitude or more (e.g., 0.0005 instead of 0.005). This avoids being driven out of the reliably fitted energy surface. It keeps the MLFF in the trained region for longer, stabilising the spilling factor.
- For long MLFF runs, it may be worth increasing HILLS_BIN to 100 in order to stay within the trained region for longer. This is only important for longer metadynamics simulations. You should also increase
ML_ESTBLOCK = 100andML_OUTBLOCK = 100in this case. - Once you have a stable ML_FF, to improve the stability for longer MLFF runs, you should collect the positions for structures that are outside of the reliably fitted energy surface. These are those that have larger spilling factors. While continuing training (see step 4 above),
ML_NMDINT = 1should be used and the Bayesian threshold set to the final value from the previous training runsML_ICRITERIA = 0.