Machine Learned Interatomic Potentials (MLIP) with ESPResSo¶

Part 3: MD simulation of water using a MLIP¶

In this part of the tutorial, we will perform a MD simulation of the water box generated in part two using a MLIP within ESPResSo. All physical parameters follow the procedure described therein, thus, we start reading in the water.xyz file from Part 2.

In [ ]:

import zntrack
import ase.io
import warnings

box = ase.io.read("water.xyz")

Next, we also need to define the model used in the MD simulation.

Exercise:

Define the model using the command zntrack.from.rev(String) selecting the appropriate string according to the models trained in Part 2 of this tutorial.

Hint:

To investigate the dependence of the results on the $r_{\mathrm{max}}$, the argument passed to zntrack.from.rev can be choosen from the following options: r_max_2_Apax, r_max_3_Apax, r_max_4_Apax, r_max_5_Apax, r_max_5_5_Apax, r_max_6_Apax.

In [ ]:

# SOLUTION CELL
with warnings.catch_warnings(action="ignore"):
    model = zntrack.from_rev("r_max_5_5_Apax")

By using the generated model.get_calculater(), we can obtain the trained interaction calculator calc.

In [ ]:

with warnings.catch_warnings(action="ignore"):
    calc = model.get_calculator()
box.calc = calc

Here, we import the necessary ESPResSo features and external modules to combine ESPResSo with MLIP

In [ ]:

import espressomd
import espressomd.observables
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pint
from espressomd.plugins.ase import ASEInterface
import tqdm.auto as tqdm

Unit¶

In the majority of studies employing machine-learned interatomic potentials, energies are expressed in electronvolts (eV), distances in angstroms, and masses in atomic units (since there are typically also the units of the training data). Although machine-learning algorithms are agnostic to the choice of units, they must be trained within a consistent unit system; thus, it is essential to identify the units being used. In atomistic simulations, temporal scales are commonly expressed in femtoseconds, lengths in Angstrom and masses in atomic mass units. Consistent with this practice, ASE adopts the same unit system as its default. In the present tutorial, we employ pint to manage units.

Exercise:

Create an instance of the pint.UnitRegistry to enable unit conversions throughout the tutorial. Assign the instance to the variable ureg.

Hint:

You may find it useful to consult the Tutorial section of pint which provides examples of initializing a unit registry.

In [ ]:

# SOLUTION CELL
ureg = pint.UnitRegistry()

Setup System for ESPResSo¶

All parameters, i.e, system size, the number of particles, the particle positions, etc., follow from the box object of ASE. For simplicity, the type of each particle is defined via their respective atomic number.

In [ ]:

system = espressomd.System(box_l=box.get_cell().diagonal())
system.periodicity = [True, True, True]
system.time_step = (1.0 * ureg.fs).m_as(
    ((1 * ureg.u * ureg.angstrom**2) / ureg.electron_volt) ** 0.5
)
system.cell_system.skin = 1  # Angstrom
system.part.clear()
for atom in box:
    system.part.add(pos=atom.position, type=atom.number, mass=atom.mass)

To combine ESPResSo and MLIP provided by ASE, we use espressomd.plugins.ase.ASEInterface. Since we know the particle types will not change during the simulation, we set assume_constant_types=True to avoid unnecessary data transfer between ESPResSo and ASE.

Note that the interface provides no mapping between ESPResSo particle types and ASE atom types. ESPResSo particle types will be directly interpreted as atomic numbers in ASE.

In [ ]:

system_ase = espressomd.plugins.ase.ASEInterface(
    system=system,
    particle_slice=system.part.all(),
    assume_constant_types=True
)

Steepest Descent¶

We use steepest descent to minimize the energy of the system.

In [ ]:

system.integrator.set_steepest_descent(
    f_max=(0.1 * ureg.eV / ureg.angstrom).magnitude,
    max_displacement=(0.001 * ureg.angstrom).magnitude,
    gamma=4.,
)

MD integration can be performed following these steps for each time step:

Update ASE with current particle data: atoms = system.ase.get()
Set the calculater of ASE as trained interaction calculater box.calc
Get forces from the calculator
Set external forces on particles: system.part.all().ext_force = atoms.get_forces()
Run one integration step

Here, we emphasize that this procedure assigns the force provided by ASE to particle.ext_force. In other words, particle.ext_force is overwritten at each step. Apart from this, all ESPResSo interactions can be used with MLIPs.

Exercise:

Integrate the sytem for 2500 steps in total.

Hint:

In this tutorial, the ASE calc should be set to the trained interaction calculator box.calc.

It is important to note that the ase_interface is designed to be flexible: it can work with any MLIP provided through ASE, and is not limited to the model trained in this tutorial.

In [ ]:

# SOLUTION CELL
tbar = tqdm.trange(250)
for _ in tbar:
    atoms = system_ase.integrate(10, box.calc)
    tbar.set_description(f"E_pot: {system_ase.atoms.get_potential_energy() / 1000.:.4f} keV")

Equilibration¶

Then, we turn on the Langevin thermostat and run a warmup integration with a symplectic Euler algorithm. Here we set the temperature as 298.15 K.

In [ ]:

system.thermostat.set_langevin(
    kT=(298.15 * ureg.K * ureg.boltzmann_constant).m_as("eV"), gamma=1., seed=42
)
system.integrator.set_symplectic_euler()

In [ ]:

tbar = tqdm.trange(100)
count_uncertainty = 0
for idx in tbar:
    system_ase.integrate(1, box.calc)
    uncertainty = box.calc.results["forces_uncertainty"].max()
    if uncertainty > 1.:
        count_uncertainty += 1
    tbar.set_description(f"Uncertainty: {uncertainty:.3f}")
if count_uncertainty > 0:
    print("Model unavailable - SELECT A DIFFERENT MODEL")

In [ ]:

tbar = tqdm.trange(500)
count_uncertainty = 0
for idx in tbar:
    system_ase.integrate(1, box.calc)
    tbar.set_description(f"E_pot: {system_ase.atoms.get_potential_energy() / 1000.:.3f} keV")

Sampling of radial distribution function¶

We now accumulate the radial distribution function (RDF) of oxygen-oxgen pairs.

In [ ]:

rdfs = []
rdf_samples = 100
rdf_bins = 80
r_min = 0.
r_max = 8.  # angstrom
o_ids = system.part.select(type=8).id
rdf_obs = espressomd.observables.RDF(
    ids1=o_ids, ids2=o_ids, n_r_bins=rdf_bins, min_r=r_min, max_r=r_max
)

sampling_period = 100
tbar = tqdm.trange(rdf_samples)
for idx in tbar:
    system_ase.integrate(sampling_period, box.calc)
    rdfs.append(rdf_obs.calculate())
rdf = np.mean(rdfs, axis=0)
rdf_err = np.std(rdfs, axis=0) / np.sqrt(len(rdfs) - 1)
bin_centers = rdf_obs.bin_centers()

Plot RDF¶

Finally, we compare both the classical forcefield simulation (from Part 1) and the one using our MLIP against the experimental RDF.

Again we expect the first peak in the RDF appears at around 3 Å, while the second peak occurs at approximately 4.5–4.8 Å, which is characteristic of the tetrahedral structure.

In [ ]:

data_neutron = pd.read_csv("soper2013tab5.csv")
data_tip4p = np.load("rdf_tip4p.npz")

fig, ax = plt.subplots(figsize=(10, 7))
ax.plot(data_neutron["rs"], data_neutron["rdf"], label="neutron scattering")
ax.plot(bin_centers, rdf, label=r"MLIP: $r_{\mathrm{max}}=5.5\ \mathrm{\AA}$")
ax.plot(10. * data_tip4p["rs"], data_tip4p["rdf"], label="TIP4P/2025")
plt.legend()
plt.xlabel(r"r / $\mathrm{\AA}$")
plt.ylabel("oxygen-oxygen RDF g(r)")
plt.show()

We observe that the first peak of the MLIP is much more faithful to the experimental data than the TIP4P/2005 classical forcefield. Subsequent peaks are also correctly positioned, although their magnitude have small discrepancies. To investigate these discrepancies, we can repeat the simulation with MLIPs trained with different cutoff values.

To save time, these RDFs were already prepared for $r_{\mathrm{max}} \in \{2, 3, 4, 5, 5.5, 6\}$ Å and stored in NumPy datasets whose filepaths follow the pattern data/rdf_rmax_{2,3,4,5,5_5,6}.npz.

Exercise:

Plot the RDFs for the different values of $r_{\mathrm{max}}$ and compare their features.
Analyze the origin of the differences observed between the RDFs.

Hint:

The .npz files can be loaded using np.load() like so:

data = {key: np.load(f"data/rdf_rmax_{key}.npz") for key in "2 3 4 5 5_5 6".split()}

In [ ]:

# SOLUTION CELL
data = {key: np.load(f"data/rdf_rmax_{key}.npz") for key in "2 3 4 5 5_5 6".split()}
fig, ax = plt.subplots(figsize=(10, 7))
ax.plot(data_neutron["rs"], data_neutron["rdf"], label="neutron scattering")
for key in ["6", "2"]:
    ax.plot(data[key]["rs"], data[key]["rdf"], label=rf"MLIP: $r_{{\mathrm{{max}}}}={key.replace('_', '.')}\ \mathrm{{\AA}}$")
ax.set_ylim([-0.2, 3.5])
plt.legend()
plt.xlabel(r"r / $\mathrm{\AA}$")
plt.ylabel("oxygen-oxygen RDF g(r)")
plt.show()

The differences between the RDFs can be attributed to the predictive capabilities of the MLIP. While models with the same cutoff but otherwise different parameters tend to yield similar RDFs, their discrepancies can be explained by errors in the energy and force predictions. In contrast, differences between RDFs from MLIPs with different cutoffs arise from their inherently different predictive abilities, particularly due to the short-sightedness of models with small cutoffs. If a model can only predict based on a cutoff ending after the first coordination shell, long-range interactions must be inferred from short-range structures. Although this can produce reasonable results—and is one of the reasons why MLIPs often perform so well—a very short cutoff will lead to incorrect structural properties.

If the cutoff is too small, the model can't capture the necessary interactions which can cause the structural differences observed in the RDFs.