modelforge.curate

The curate module provides an API for creating modelforge compatible hdf5 datasets from raw data. The curate class works on a hierarchy of classes that are used to define the structure of the dataset.

Basic Usage

At the top level we have SourceDataset. Each instance of SourceDataset contains records that are instances of the Record class. Each instance of the Record class contains properties. Properties are defined using pydanitic models that inherit from the PropertyBaseClass.

The use of pydantic models allows for considerable validation of the input properties. The entire API is designed to put an emphasis on validation at the time of dataset construction, rather than at the time of dataset loading. This includes checking for compatibility of units and ensuring consistent number of atoms and configurations with a record.

The following example demonstrates how to use the curate API to create a dataset, using fictitious data.

from modelforge.curate import SourceDataset, Record
from modelforge.curate.properties import AtomicNumbers, Positions, Energies, Forces, MetaData

from openff.units import unit
import numpy as np

new_dataset = SourceDataset(name="test_dataset")

record_mol1 = Record(name='mol1')

atomic_numbers = AtomicNumbers(value=np.array([[1], [6]]))

positions = Positions(
    value=np.array([[[1.0, 1.0, 1.0],
                    [2.0, 2.0, 2.0]]]),
    units="nanometer"
)

energies = Energies(
    name="total_energies",
    value=np.array([[0.1]]),
    units=unit.hartree
)

smiles = MetaData(name='smiles', value='[CH]')

record_mol1.add_properties([atomic_numbers, positions,energies, smiles])

new_dataset.validate_records()

checksum = new_dataset.to_hdf5(file_path="./", file_name="test_dataset.hdf5")

The above code creates a dataset with a single record. The record contains the atomic numbers, positions, energies, and a metadata field for the SMILES string. The dataset is then validated and written to an hdf5 file. The checksum of the file is returned. Jupyter notebooks are provided in the examples folder to demonstrate more comprehensive usage.

Note, this module also includes classes for curating individual datasets that inherit from a DatasetCuration baseclass in curation_baseclass.py file. This baseclass provides a common interface for curating the datasets, specifically providing a wrapper to make it easier to gather subsets of a dataset and impose other restrictions on the dataset at the time of writing to an hdf5 file.

Examples