Training

Training: Overview

During training the parameters of a potential are fitted to reproduce the target properties provided by a dataset. These are typically energies and forces, but can also be other properties provided by a dataset (e.g., dipole moments).

The properties a given potential can be trained on are determined by the potential itself and by the loss function used to train the potential. Each of the models implemented in modelforge have flexible numbers of output heads, each of which can be fitted against a different scalar/tensor property.

Modelforge uses Pytorch Lightning to train models. The training process is controlled by a PotentialTrainer object, which is responsible for managing the training loop, the optimizer, the learning rate scheduler, and the early stopping criteria. The training process is controlled by a configuration file, training.toml, which specifies the number of epochs, the learning rate, the loss function, and the splitting strategy for the dataset. The training process can be started by

Training Configuration

The training process is controlled by a configuration file, typically written in TOML format. This file includes various sections for configuring different aspects of the training, such as logging, learning rate scheduling, loss functions, dataset splitting strategies, and early stopping criteria.

Training Configuration

[training]
number_of_epochs = 2
remove_self_energies = true
shift_center_of_mass_to_origin = false
batch_size = 128
lr = 5e-4
monitor = "val/per_system_energy/rmse" # Common monitor key
plot_frequency = 1
gradient_clip_val = 5.0
# ------------------------------------------------------------ #
[training.experiment_logger]
logger_name = "tensorboard" # this will set which logger to use
[training.experiment_logger.tensorboard_configuration]
save_dir = "logs"
# ------------------------------------------------------------ #
[training.experiment_logger.wandb_configuration]
save_dir = "logs"
project = "tests"
group = "exp00"
log_model = true
job_type = "testing"
tags = ["v_0.1.0"]
notes = "testing training"
# ------------------------------------------------------------ #
# Learning Rate Scheduler Configuration
[training.lr_scheduler]
scheduler_name = "ReduceLROnPlateau"
frequency = 1
interval = "epoch"
monitor = "val/per_system_energy/rmse"
mode = "min"
factor = 0.1
patience = 10
threshold = 0.1
threshold_mode = "abs"
cooldown = 5
min_lr = 1e-8
eps = 1e-8                             # Optional, default is 1e-8
# ------------------------------------------------------------ #
[training.loss_parameter]
loss_components = ['per_system_energy'] #, 'per_atom_force']
# ------------------------------------------------------------ #
[training.loss_parameter.weight]
per_system_energy = 1.0
# ------------------------------------------------------------ #
[training.early_stopping]
verbose = true
min_delta = 0.001
patience = 50
# ------------------------------------------------------------ #
[training.splitting_strategy]
name = "random_record_splitting_strategy"
data_split = [0.8, 0.1, 0.1]
seed = 42
# ------------------------------------------------------------ #

The TOML files are split into four categories: potential, training, runtime and dataset. While it is possible to read in a single TOML file defining fields for all three categories, often it is useful to read them in separately (a common use case where this is useful is when a potential is trained on different datasets — instead of repeating the training and potential sections in each TOML file, only the dataset.toml file needs to be changed).

Note, as discussed in the Dataset Module documentation, the dataset can have the self-energy removed (remove_self_energies option set to true’). The dataset can also have the energy shifted (after self-energy removal, if performed) using the `shift_energies option; this can be set to [‘min’, ‘max’, or ‘mean’] to shift the energies accordingly (min shifts the lowest energy to 0, max shifts the highest energy to 0, and mean shifts the mean energy to 0). shift_center_of_mass_to_origin can also be set to true to center the coordinates of each system at the origin, necessary for computing the dipole moments and quadrupole moments correctly.

Learning rate scheduler

The learning rate scheduler is responsible for adjusting the learning rate during training. Modelforge uses the REduceLROnPlateau scheduler, which reduces the learning rate by a factor when the RMSE of the energy prediction on the validation set does not improve for a given number of epochs. The scheduler is controlled by the parameters in the [training.lr_scheduler] section of the training.toml file.

Loss function

The loss function quantifies the discrepancy between the model’s predictions and the target properties, providing a scalar value that guides the optimizer in updating the model’s parameters. This function is configured in the [training.loss] section of the training TOML file.

Depending on the specified loss_components` section, the loss function can combine various individual loss functions. Modelforge always includes the mean squared error (MSE) for energy prediction, and may also incorporate MSE for force prediction, dipole moment prediction (computed from predicted partial charges), quadrupole moment prediction (computed from predicted partial charges), total_charge, and partial charge prediction.

The following keys can be used in the loss_components section. Note, to use these losses, corresponding known values need to be set within modelforge (see the dataset documentation for more details regarding defining the PropertyNames). Each key is listed with corresponding properties that need to be defined in the dataset toml file:

per_system_energy’ – corresponds to `E in the property names defitinion.
per_atom_force – corresponds to F in the property names definition.
per_system_total_charge – corresponds to total_charge in the property names definition.
per_system_dipole_moment – corresponds to dipole_moment in the property names definition.
per_system_quadrupole_moment – corresponds to quadrupole_moment in the property names definition.
per_atom_charge – corresponds to partial_charges in the property names definition.

The design of the loss function is intrinsically linked to the structure of the energy function. For instance, if the energy function aggregates atomic energies, then loss_components should include per_system_energy and optionally, per_atom_force.

For each component in the loss, a weight must be specified in the loss_weights section of the training TOML file. The weights determine the relative importance of each component in the overall loss function. The loss function is then computed as a weighted sum of the individual components.

The weights for the losses can also be changed during the training run by providing a target weight (i.e., the final weight) and a mixing step (i.e., the step size for the linear interpolation between the current and target weights). This is useful for gradually increasing/decreasing the importance of certain loss components during training, which can help stabilize the training process. Note, that the target weights and mixing steps are optional, and if not provided, the loss weights will remain constant throughout the training process. In the snippet below, only per_system_dipole_moment has a target weight and a mixing step defined, meaning that the weight for this component will be gradually increased from 0.1 to 1 by increments of 0.01 over the course of training, while the other components will remain at their initial weights.

[training.loss_parameter]
loss_components = ['per_system_energy', 'per_system_dipole_moment', "per_atom_charge"]

[training.loss_parameter.weight]
per_system_energy = 0.00001
per_system_dipole_moment = 0.1
per_atom_charge = 1

[training.loss_parameter.target_weight]
per_system_dipole_moment = 1

[training.loss_parameter.mixing_steps]
per_system_dipole_moment= 0.01

Predicting Short-Range Atomic Energies

If the total atomic energy is calculated with a short cutoff radius, we can directly match the sum of the atomic energies E_i (as predicted by the potential model architecture) to the total energy of the molecule (provided by the dataset).

In that case the total energy is calculated as

\[E = \sum_i^N E_i\]

The loss function can then be formulated as:

\[L(E) = (E - E^{pred})^2\]

Alternatively, if the dataset includes per atom forces, the loss function can be expressed as:

\[L(E,F) = w_E * (E - E^{pred})^2 + w_F \frac{1}{3N} \sum_i^N \sum_j^3 (F_{ij} - F_{ij}^{pred})^2\]

where w_E and w_F are the weights for the energy and force components, respectively.

Note, the ZBL potential can also be included, providing a repulsive interaction at extremely short distances (i.e., atomic overlap), as discussed in the Potentials documentation. As with electrostatic or vdw interactions, these contributions must be summed into the total per_system_energy using postprocessing operations. An example code block is included below that shows how to set up postprocessing operations to include the ZBL interactions, as well as how to sum these into the total per_system_energy.

[potential.postprocessing_parameter]
properties_to_process = ['per_atom_energy', 'per_system_zbl_energy',  'sum_per_system_energy']

[potential.postprocessing_parameter.per_atom_energy]
normalize = true
from_atom_to_system_reduction = true
keep_per_atom_property = true

[potential.postprocessing_parameter.per_system_zbl_energy]

# this will be added to the `per_system_energy` that results from the `per_atom_energy` reduction operation
[potential.postprocessing_parameter.sum_per_system_energy]
contributions = ['per_system_zbl_energy']

Predicting Longer-Range Interactions

In scenarios where longer-range interactions are considered, additional terms are incorporated into the loss function. There are two longer range interactions that are of interest: dispersion interactions and electrostatics.

To calculate electrostatics the first moment of the charge density (partial charges) is predicted by the machine learning potential. The atomic charges are then used to calculate the long range electrostatics. The expression for the total energy is then:

\[E = \sum_i^N E_i + k_c \sum_i^N \sum_{j>i}^N \frac{q_i q_j}{r_{ij}}\]

where k_c is the Coulomb constant, q_i and q_j are the atomic charges. To avoid singularities at r_{ij} = 0, the Coulomb interaction is modified following the approach taken by PhysNet (J. Chem. Theory Comput. 2019, 15, 3678-3693, DOI:10.1021/acs.jctc.9b00181, see eqn: 12 and 13 in the reference). Specifically,

\[E_coulomb = k_c \sum_i^N \sum_{j>i}^N q_i q_j\chi(r_{ij})\]

where:

\[\chi(r_{ij}) = \phi(2r_{ij})\frac{1}/{\sqrt{r_{ij}^2 + 1}} + (1-\phi(2r_{ij}))\frac{1}{r_{ij}}.\]

and phi(r) is represented by the PhysNet cutoff function:

\[\begin{split}\phi(r) = \begin{cases} 1-6(\frac{r}{r_{cut}})^5 + 15(\frac{r}{r_{cut}})^4 - 10(\frac{r}{r_{cut}})^3 & r \leq r_{cut} \\ 0 & r > r_{cut} \end{cases}\end{split}\]

The loss function to optimize charges is represented as, where we note that this includes both the partial charges and the dipole moment; these can be toggled on and off separately:

\[L(E,F,Q) = L(E,F) + w_Q (\sum_i^N q_i - Q_i^{pred})^2 + \frac{w_p}{3} \sum_j^3 (\sum_i^N q_i r_i,j - p_j^{ref})^2\]

where w_Q is the weight for the charge component, w_p the weight for the dipole moment component, and p_j^{ref} is the reference dipole moment.

van Der Waals interactions can also be calculated; this is done using the tad-DFTD3 package. These interactions are calculated outside of the machine learning potential, and added to the total per_system_energy.

As also discussed in the context of AIMNET2 in the Potentials documentation, to include electrostic and/or dispersion interactions, we must sum these into the total per_system_energy. This is not done automatically, and must be specificied in the provided .toml file, specifically as post processing operations. An example code block is included below that shows how to set up postprocessing operations to include electrostatics and van der Waals interactions, as well as how to sum these into the total per_system_energy.

[potential.postprocessing_parameter]
properties_to_process = ['per_atom_energy', 'per_system_electrostatic_energy', 'per_system_vdw_energy', 'sum_per_system_energy']

[potential.postprocessing_parameter.per_atom_energy]
normalize = true
from_atom_to_system_reduction = true
keep_per_atom_property = true


[potential.postprocessing_parameter.per_system_electrostatic_energy]
electrostatic_strategy = "coulomb"
maximum_interaction_radius = "15.0 angstrom"


[potential.postprocessing_parameter.per_system_vdw_energy]
maximum_interaction_radius = "10.0 angstrom"

# this will be added to the per_system_energy that results from the `per_atom_energy` reduction operation
[potential.postprocessing_parameter.sum_per_system_energy]
contributions = ['per_system_electrostatic_energy', 'per_system_vdw_energy']

Splitting Strategies

The dataset splitting strategy is crucial for ensuring that a potential generalizes well to unseen data. The recommended approach in modelforge is to randomly split the dataset into 80% training, 10% validation, and 10% test sets, based on molecules rather than individual conformations. This ensures that different conformations of the same molecule are consistently assigned to the same split, preventing data leakage and ensuring robust model evaluation.

Modelforge also provides other splitting strategies, including:

FirstComeFirstServeStrategy: Splits the dataset based on the order of records (molecules).
RandomSplittingStrategy: Splits the dataset randomly based on configurations.
RandomRecordSplittingStrategy: Splits the dataset randomly based on records (i.e., systems, keeping all configurations of a system in the same split).

To use a different data split ratio, you can specify a custom split list in the splitting strategy. The most effective way to pass this information to the training process is by defining the appropriate fields in the TOML file providing the training parameters, see the TOML file above.

Note, The RandomSplittingStrategy and RandomRecordSplittingStrategy classes accept a seed parameter, which can be set to ensure reproducibility of the data splits across different training runs. To provide further control, the RandomRecordSplittingStrategy class also accepts a test_seed parameter, which allows for independent control over the randomness of the test set split, separate from the training and validation splits. For a given dataset, if test_seed is not changed, the test set will remain the same even if the seed for the training and validation sets is altered (training and validation will change if seed is changed in this case). This feature is particularly useful for benchmarking and comparing model performance across different training configurations while keeping the test set consistent.

To specify in the toml file, use the following keywords for the strategy argument: - “first_come_first_serve_strategy” - “random_splitting_strategy” - “random_record_splitting_strategy”

For example, the following snippet would specify the RandomRecordSplittingStrategy with a 70/15/15 split, a seed of 42 for the training and validation sets, and a test_seed of 7 for the test set:

[dataset.splitting_strategy]
strategy = "random_record_splitting_strategy"
split = [0.7, 0.15, 0.15]
seed = 42
test_seed = 7

Train a Model

The best way to get started is to train a model. We provide a script to train one of the implemented models on a dataset using default configurations. The script, along with a default configuration TOML file, can be found in the scripts` directory. The TOML file provides parameters to train a TensorNet potential on the QM9 dataset on a single GPU.

The recommended method for training models is through the perform_training.py` script. This script automates the training process by reading the TOML configuration files, which define the potential (model), dataset, training routine, and runtime environment. The script then initializes a Trainer class from PyTorch Lightning, which handles the training loop, gradient updates, and evaluation metrics. During training, the model is optimized on the specified dataset according to the defined potential and training routine. After training completes, the script outputs performance metrics on the validation and test sets, providing insights into the model’s accuracy and generalization.

Additionally, the script saves the trained model with checkpoint files in the directory specified by the [save_dir]` field in the runtime section of the TOML file, enabling you to resume training or use the potential model for inference later.

To initiate the training process, execute the following command in your terminal (inside the scripts directory):

python perform_training.py
    --condensed_config_path="config.toml"
    --accelerator="gpu"
    --device=1

This command specifies the path to the configuration file, selects GPU acceleration, and designates the device index to be used during training. Adjust these parameters as necessary to fit your computational setup and training needs.