modelforge.curate : Record and SourceDataset

This notebook focuses on functionality within the Records and SourceDataset classes.

[1]:
from modelforge.curate import Record, SourceDataset
from modelforge.utils.units import GlobalUnitSystem
from modelforge.curate import AtomicNumbers, Positions, Energies, Forces, MetaData

from openff.units import unit

import numpy as np

Initializating records and datasets

To start, we will create a new instance of the SourceDataset class to store the dataset. We will populate this with 10 records, each with 3 configurations.

[2]:
new_dataset = SourceDataset(name="test_dataset")

for i in range(0,10):
    record = Record(f"mol_{i}")

    atomic_numbers = AtomicNumbers(value=np.array([[1], [6]]))

    positions = Positions(
        value=np.array([[[i, 1.0, 1.0], [2.0, 2.0, 2.0]],
                        [[i, 2.0, 1.0], [2.0, 2.0, 2.0]],
                        [[i, 3.0, 1.0], [2.0, 2.0, 2.0]]]),
        units="nanometer"
    )

    total_energies = Energies(
        name="total_energies",
        value=np.array([[i],
                        [i+0.1],
                        [i+0.2]]),
        units=unit.hartree
    )
    forces = Forces(
        name="forces",
        value=np.array([[[1.0, 1.0, 1.0], [2.0, 2.0, 2.0]],
                        [[10.0, 2.0, 1.0], [2.0, 2.0, 2.0]],
                        [[20.0, 3.0, 1.0], [2.0, 2.0, 2.0]]]),
        units = unit.kilocalorie_per_mole/unit.nanometer,
    )
    record.add_properties([atomic_numbers, positions, total_energies, forces])
    new_dataset.add_record(record)
2025-08-25 12:06:03.587 | WARNING  | modelforge.curate.sourcedataset:__init__:66 - Database file test_dataset.sqlite already exists in ./. Removing it.

Examining the dataset

Let us examine the dataset:

[3]:
print("total configurations: ", new_dataset.total_configs())
print("total records: ", new_dataset.total_records())

import pprint
print("dataset summary:")
pprint.pprint(new_dataset.generate_dataset_summary())

total configurations:  30
total records:  10
dataset summary:
{'name': 'test_dataset',
 'properties': {'atomic_numbers': {'classification': 'atomic_numbers'},
                'forces': {'classification': 'per_atom',
                           'units': 'kilojoule_per_mole / nanometer'},
                'positions': {'classification': 'per_atom',
                              'units': 'nanometer'},
                'total_energies': {'classification': 'per_system',
                                   'units': 'kilojoule_per_mole'}},
 'total_configurations': 30,
 'total_records': 10}

Extracting/Updating records

Extract a copy of a record

We can extract a copy of any record using the get_record function.

[5]:
record_temp = new_dataset.get_record("mol_0")
print(record_temp)
name: mol_0
* n_atoms: 2
* n_configs: 3
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties: (['positions', 'forces']):
 -  name='positions' value=array([[[0., 1., 1.],
        [2., 2., 2.]],

       [[0., 2., 1.],
        [2., 2., 2.]],

       [[0., 3., 1.],
        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=3 n_atoms=2
 -  name='forces' value=array([[[ 1.,  1.,  1.],
        [ 2.,  2.,  2.]],

       [[10.,  2.,  1.],
        [ 2.,  2.,  2.]],

       [[20.,  3.,  1.],
        [ 2.,  2.,  2.]]]) units=<Unit('kilocalorie_per_mole / nanometer')> classification='per_atom' property_type='force' n_configs=3 n_atoms=2
* per-system properties: (['total_energies']):
 -  name='total_energies' value=array([[0. ],
       [0.1],
       [0.2]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=3 n_atoms=None
* meta_data: ([])

Update a record in the dataset

Since get_record returns a copy, if the record is changed, the update_record function needs to be used to updated it within the dataset. Here we can add metadata to this record and update it.

[6]:
smiles = MetaData(name='smiles', value='[CH]')

record_temp.add_property(smiles)

new_dataset.update_record(record_temp)

new_dataset.print_record("mol_0")
name: mol_0
* n_atoms: 2
* n_configs: 3
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties: (['positions', 'forces']):
 -  name='positions' value=array([[[0., 1., 1.],
        [2., 2., 2.]],

       [[0., 2., 1.],
        [2., 2., 2.]],

       [[0., 3., 1.],
        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=3 n_atoms=2
 -  name='forces' value=array([[[ 1.,  1.,  1.],
        [ 2.,  2.,  2.]],

       [[10.,  2.,  1.],
        [ 2.,  2.,  2.]],

       [[20.,  3.,  1.],
        [ 2.,  2.,  2.]]]) units=<Unit('kilocalorie_per_mole / nanometer')> classification='per_atom' property_type='force' n_configs=3 n_atoms=2
* per-system properties: (['total_energies']):
 -  name='total_energies' value=array([[0. ],
       [0.1],
       [0.2]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=3 n_atoms=None
* meta_data: (['smiles'])
 -  name='smiles' value='[CH]' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None

Removing a record from a dataset

We can remove a record using the remove_record function in the SourceDataset class.

[7]:
print("total_records: ", new_dataset.total_records())
new_dataset.remove_record("mol_9")
print("total_records: ", new_dataset.total_records())
total_records:  10
total_records:  9

Slicing a record

We can slice a record, returning a copy of the record that only includes subset of configurations. This will be applied to all properties with the record.

This can be done at the level of a record or called via a wrapping function in the dataset.

the code below will return the first 2 records out of the 3 total.

[8]:
record_sliced = record_temp.slice_record(min=0, max=1)

print(record_sliced)
name: mol_0
* n_atoms: 2
* n_configs: 1
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties: (['positions', 'forces']):
 -  name='positions' value=array([[[0., 1., 1.],
        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=1 n_atoms=2
 -  name='forces' value=array([[[1., 1., 1.],
        [2., 2., 2.]]]) units=<Unit('kilocalorie_per_mole / nanometer')> classification='per_atom' property_type='force' n_configs=1 n_atoms=2
* per-system properties: (['total_energies']):
 -  name='total_energies' value=array([[0.]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None
* meta_data: (['smiles'])
 -  name='smiles' value='[CH]' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None

[9]:
record_sliced = new_dataset.slice_record("mol_0", min=0, max=1)
print(record_sliced)
name: mol_0
* n_atoms: 2
* n_configs: 1
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties: (['positions', 'forces']):
 -  name='positions' value=array([[[0., 1., 1.],
        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=1 n_atoms=2
 -  name='forces' value=array([[[1., 1., 1.],
        [2., 2., 2.]]]) units=<Unit('kilocalorie_per_mole / nanometer')> classification='per_atom' property_type='force' n_configs=1 n_atoms=2
* per-system properties: (['total_energies']):
 -  name='total_energies' value=array([[0.]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None
* meta_data: (['smiles'])
 -  name='smiles' value='[CH]' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None

Limiting to a subset of atomic numbers

We can query if a record contains atomic numbers within a specified set using the contains_atomic_numbers in the Records class. This will return true if the atomic numbers in the record are all represented in the provided array and false if any atomic numbers in the record are not included in the provided array.

Note, this function will not typically need to be called directly, as the subset_dataset function in the SourceDataset provides a wrapper for this functionality on the entire dataset (discussed separately later).

[10]:
record_temp.contains_atomic_numbers(np.array([1,6]))
[10]:
True
[11]:
record_temp.contains_atomic_numbers(np.array([1,8]))
[11]:
False

Removing high force configurations

Often, we wish to remove configurations with very high forces. The remove_high_force_configs function in the Records class can be used to return a copy of the record, excluding those configurations where the magnitude of the force exceeds the specified threshold. By default, this will filter using the name “forces” (i.e., it will look for a property with name “forces” within the record); this can be toggled if the force property is named differently.

Note, this function will not typically need to be called directly, as the subset_dataset function in the SourceDataset provides a wrapper for this functionality on the entire dataset (discussed separately later).

For example, below we can filter out any configurations with a force greater than 15, which will eliminate the last configuration of the record (see initialization above).

[12]:
record_max_force = record_temp.remove_high_force_configs(unit.Quantity(15, unit.kilocalorie_per_mole/unit.nanometer), "forces")

print(record_max_force)
name: mol_0
* n_atoms: 2
* n_configs: 2
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties: (['positions', 'forces']):
 -  name='positions' value=array([[[0., 1., 1.],
        [2., 2., 2.]],

       [[0., 2., 1.],
        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=2 n_atoms=2
 -  name='forces' value=array([[[ 1.,  1.,  1.],
        [ 2.,  2.,  2.]],

       [[10.,  2.,  1.],
        [ 2.,  2.,  2.]]]) units=<Unit('kilocalorie_per_mole / nanometer')> classification='per_atom' property_type='force' n_configs=2 n_atoms=2
* per-system properties: (['total_energies']):
 -  name='total_energies' value=array([[0. ],
       [0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=2 n_atoms=None
* meta_data: (['smiles'])
 -  name='smiles' value='[CH]' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None

Subsetting a dataset

SourceDataset includes a function called subset_dataset which returns a copy of the dataset with various filters applied. The filters that can be applied include:

  • total_records: Maximum number of records to include in the subset.

  • total_configurations: Total number of conformers to include in the subset.

  • max_configurations_per_record: Maximum number of conformers to include per record. If None, all conformers in a record will be included.

  • atomic_numbers_to_limit: An array of atomic species to limit the dataset to. Any molecules that contain elements outside of this list will be igonored

  • max_force: If set, configurations with forces greater than this value will be removed.

  • final_configuration_only: If True, only the final configuration of each record will be included in the subset.

  • max_configurations_per_record_order: Can be “start”, “end”, or “random”, which defines whether configurations are taking from the start of the underlying array, the end of the array, or randomly chosen, respectively. Note, users can also pass the “seed” used to initialize the random number generator used to perform random record selection to ensure reproducibility and/or generate unique subsets.

Note, total_records and total_configurations can not be used in conjunction.

Below, we create a new dataset that will limit to a max number of 2 configurations per record, and a total of 10 total configurations.

[13]:
dataset_subset = new_dataset.subset_dataset(new_dataset_name="dataset_subset", total_configurations=10, max_configurations_per_record=2)

print(dataset_subset.total_records())
print(dataset_subset.total_configs())
2025-08-25 12:09:41.588 | WARNING  | modelforge.curate.sourcedataset:__init__:66 - Database file dataset_subset.sqlite already exists in ./. Removing it.
5
10

SourceDataset backend sqlite database

The SourceDataset class stores records within a sqlite database rather than in memory. The name and location of this database can be set at instantiation of the dataset. If these are not set, the default localation will be “./” and the database will be named based upon the name of the dataset (replacing any spaces with an underscore). The code below would produce the same dataset as the default if no values were provided.

[14]:
new_dataset2 = SourceDataset(name="new dataset2", local_db_dir="./", local_db_name="new_dataset2.sqlite")

The use of a sqlite backend not only reduces the memory footprint, but also allows a dataset to be loaded from an existing database. Being able to load from the database allows us to avoid having to go through the processing of a dataset (i.e., setting up individual properties, Records, etc.).

The following code will load up the subsetted dataset generated in the prior cells:

[15]:
new_dataset2  = SourceDataset(name="new dataset2", local_db_dir="./", local_db_name="dataset_subset.sqlite", read_from_local_db=True)

print(new_dataset2.total_records())
print(new_dataset2.total_configs())
5
10

When subsetting a dataset, we can also specify the name and location of the database that will be generated. Otherwise, the same default behavior is used (i.e., based on dataset name). This function will return an error if the new and old datasets have the same name.

[ ]: