modelforge.curate : properties

This notebook will focus on a more thorough examination of defining properties.

[1]:

from modelforge.curate import Record, SourceDataset
from modelforge.utils.units import GlobalUnitSystem
from modelforge.curate.properties import AtomicNumbers, Positions, Energies, Forces, MetaData

from openff.units import unit

import numpy as np

Properties

Each property inherits from the PropertyBaseClass pydantic model and has the following fields:

name : str : unique identifier for the property
value : ndarray : array containing the values (note, the MetaData property allows this to be set to a str, int, float, and list in addition to a numpy array)
units : unit.Unit : OpenFF.units
classification : PropertyClassification enum : specifies if the property is “atomic_numbers”, “per_atom”, “per_system”, or “meta_data”
property_type : PropertyType enum: specifies the type of property (e.g., length, energy, force, etc.) used for validating the specified units

classification and property_type are inherent to the property and do not need to be modified when a property is instantiated.

While a default value is set for name field for each property (e.g., “energies” for the Energies property), this value typically should be set at the time of instantiation to a unique and appropriate key. Setting the name field will be essentialy for records that contain, e.g., multiple energy entries (e.g., total_energy, dispersion_energy, electronic_energy, etc.).

The following demonstrates defining a record with properties “atomic_numbers”, “positions”, “total_energies”, “dispersion_energies”, and “smiles”

[2]:

atomic_numbers = AtomicNumbers(value=np.array([[1], [6]]))

positions = Positions(
    value=np.array([[[1.0, 1.0, 1.0], [2.0, 2.0, 2.0]]]),
    units="nanometer"
)

total_energies = Energies(
    name="total_energies",
    value=np.array([[1]]),
    units=unit.hartree
)

dispersion_energies = Energies(
    name="dispersion_energies",
    value=np.array([[0.1]]),
    units=unit.hartree
)

smiles = MetaData(name='smiles', value='[CH]')

record_mol1 = Record(name='mol1')
record_mol1.add_properties([total_energies, dispersion_energies, atomic_numbers, positions, smiles])

print(record_mol1)

name: mol1
* n_atoms: 2
* n_configs: 1
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties: (['positions']):
 -  name='positions' value=array([[[1., 1., 1.],
        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=1 n_atoms=2
* per-system properties: (['total_energies', 'dispersion_energies']):
 -  name='total_energies' value=array([[1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None
 -  name='dispersion_energies' value=array([[0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None
* meta_data: (['smiles'])
 -  name='smiles' value='[CH]' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None

As noted in the “basic_usage.ipynb” notebook, the name field is used as a unique key. An error will be raised if we try to add a property with the same key twice. E.g., the following will raise an error as we have already set the “total_energies”.

[3]:

record_mol1.add_property(total_energies)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], line 1
----> 1 record_mol1.add_property(total_energies)

File ~/PycharmProjects/modelforge/modelforge-curate/modelforge/curate/record.py:433, in Record.add_property(self, property)
    429     error_msg = f"Property with name {property.name} already exists in the record {self.name}."
    430     error_msg += (
    431         f"Set append_property=True to append to the existing property."
    432     )
--> 433     raise ValueError(error_msg)
    435 assert (
    436     self.per_system[property.name].value.shape[1]
    437     == property.value.shape[1]
    438 )
    439 temp_array = property.value

ValueError: Property with name total_energies already exists in the record mol1.Set append_property=True to append to the existing property.

Appending properties

In some cases, we may not have data for all configurations available to use when instantiating a property. For example, the positions for different configurations may exist in different .xyz files. To handle these cases, the Record class can be instantiated with append_property set to True. In such cases, adding a property a second time will append the new data to the existing array.

For example, the following will use initialize the same Record as above, but allowing properties to be appended:abs

[4]:

record_mol1_append = Record(name='mol1', append_property="True")
record_mol1_append.add_properties([total_energies, dispersion_energies, atomic_numbers, positions, smiles])

Now, if we add “total_energies” a second time, this will not raise an error, rather it will append the energy to the existing array.

[5]:

record_mol1_append.add_property(total_energies)

If print the record we will now see that the “total_energies” property now contains value = [[1], [1]] and reports n_configs = 2.

[6]:

print(record_mol1_append)

2025-05-28 16:08:14.963 | WARNING  | modelforge.curate.record:_validate_n_configs:265 - Number of configurations for properties in record mol1 are not consistent.
2025-05-28 16:08:14.965 | WARNING  | modelforge.curate.record:_validate_n_configs:269 -  - positions : 1
2025-05-28 16:08:14.966 | WARNING  | modelforge.curate.record:_validate_n_configs:271 -  - total_energies : 2
2025-05-28 16:08:14.967 | WARNING  | modelforge.curate.record:_validate_n_configs:271 -  - dispersion_energies : 1

name: mol1
* n_atoms: 2
* n_configs: cannot be determined, see warnings log
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties: (['positions']):
 -  name='positions' value=array([[[1., 1., 1.],
        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=1 n_atoms=2
* per-system properties: (['total_energies', 'dispersion_energies']):
 -  name='total_energies' value=array([[1],
       [1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=2 n_atoms=None
 -  name='dispersion_energies' value=array([[0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None
* meta_data: (['smiles'])
 -  name='smiles' value='[CH]' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None

Note, this produces several warnings because the number of configurations is now not consistent in the record (printing the record calls the validate function in the class)

[7]:

record_mol1_append.validate()

2025-05-28 16:08:16.184 | WARNING  | modelforge.curate.record:_validate_n_configs:265 - Number of configurations for properties in record mol1 are not consistent.
2025-05-28 16:08:16.185 | WARNING  | modelforge.curate.record:_validate_n_configs:269 -  - positions : 1
2025-05-28 16:08:16.187 | WARNING  | modelforge.curate.record:_validate_n_configs:271 -  - total_energies : 2
2025-05-28 16:08:16.188 | WARNING  | modelforge.curate.record:_validate_n_configs:271 -  - dispersion_energies : 1

[7]:

False

To resolve this we simply can add the “positions” and “dispersion_energies” a second time as well:

[8]:

record_mol1_append.add_properties([dispersion_energies, positions])

[9]:

print(record_mol1_append)

name: mol1
* n_atoms: 2
* n_configs: 2
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties: (['positions']):
 -  name='positions' value=array([[[1., 1., 1.],
        [2., 2., 2.]],

       [[1., 1., 1.],
        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=2 n_atoms=2
* per-system properties: (['total_energies', 'dispersion_energies']):
 -  name='total_energies' value=array([[1],
       [1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=2 n_atoms=None
 -  name='dispersion_energies' value=array([[0.1],
       [0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=2 n_atoms=None
* meta_data: (['smiles'])
 -  name='smiles' value='[CH]' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None

When appending to an existing property, the code will first check to see if the shapes of the array are compatible. For example, if we try to add positions for a molecule with a different number of atoms, this will produce an error, as the shapes of the arrays are not compatible.

[10]:

positions2 = Positions(value= [[[1,1,1], [2,2,2], [3,3,3]]], units=unit.nanometer)

record_mol1_append.add_property(positions2)

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[10], line 3
      1 positions2 = Positions(value= [[[1,1,1], [2,2,2], [3,3,3]]], units=unit.nanometer)
----> 3 record_mol1_append.add_property(positions2)

File ~/PycharmProjects/modelforge/modelforge-curate/modelforge/curate/record.py:385, in Record.add_property(self, property)
    380     raise ValueError(error_msg)
    381 # if the property already exists, we will use vstack to add it to the existing array
    382 # after first checking that the dimensions are consistent
    383 # note we do not check shape[0], as that corresponds to the number of configurations
    384 assert (
--> 385     self.per_atom[property.name].value.shape[1]
    386     == property.value.shape[1]
    387 ), f"{self.name}: n_atoms of {property.name} does not: {property.value.shape[1]} != {self.per_atom[property.name].value.shape[1]}."
    388 assert (
    389     self.per_atom[property.name].value.shape[2]
    390     == property.value.shape[2]
    391 )
    392 # In order to append to the array, everything needs to have the same units
    393 # We will use the units of the first property that was added

AssertionError: mol1: n_atoms of positions does not: 3 != 2.

The units are also compared and converted if necessary before appending. For example, we defined energy in units of hartree above; if we define energy in a different unit and append, it will automatically be converted to hartrees.

[11]:

total_energies2 = Energies(
    name="total_energies",
    value=np.array([[1]]),
    units=unit.kilocalories_per_mole
)
record_mol1_append.add_property(total_energies2)

print(record_mol1_append)

2025-05-28 16:08:20.506 | WARNING  | modelforge.curate.record:_validate_n_configs:265 - Number of configurations for properties in record mol1 are not consistent.
2025-05-28 16:08:20.508 | WARNING  | modelforge.curate.record:_validate_n_configs:269 -  - positions : 2
2025-05-28 16:08:20.509 | WARNING  | modelforge.curate.record:_validate_n_configs:271 -  - total_energies : 3
2025-05-28 16:08:20.509 | WARNING  | modelforge.curate.record:_validate_n_configs:271 -  - dispersion_energies : 2

name: mol1
* n_atoms: 2
* n_configs: cannot be determined, see warnings log
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties: (['positions']):
 -  name='positions' value=array([[[1., 1., 1.],
        [2., 2., 2.]],

       [[1., 1., 1.],
        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=2 n_atoms=2
* per-system properties: (['total_energies', 'dispersion_energies']):
 -  name='total_energies' value=array([[1.       ],
       [1.       ],
       [0.0015936]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=3 n_atoms=None
 -  name='dispersion_energies' value=array([[0.1],
       [0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=2 n_atoms=None
* meta_data: (['smiles'])
 -  name='smiles' value='[CH]' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None

Adding properties directly to a dataset

Rather than creating an instance of the Record class and adding this to the dataset, we can use the SourceDataset class directly. The functions in SourceDataset effectively just provide wrappers to the functions that exist within the Record class. As such, both approaches are equivalent but one may be more convenient depending on the structure of the original dataset that is being curated.

The following code performs the same functionality in the two ways. First we will define the common elements (i.e., properties):

[12]:

#define the datset
new_dataset = SourceDataset('test_dataset')

# define the properties
atomic_numbers = AtomicNumbers(value=np.array([[1], [6]]))
positions = Positions(
    value=np.array([[[1.0, 1.0, 1.0], [2.0, 2.0, 2.0]]]),
    units="nanometer"
)

total_energies = Energies(
    name="total_energies",
    value=np.array([[1]]),
    units=unit.hartree
)

2025-05-28 16:08:21.307 | WARNING  | modelforge.curate.sourcedataset:__init__:66 - Database file test_dataset.sqlite already exists in ./. Removing it.

Approach 1: Create a Record, add properties to the Record, add Record to the dataset

[13]:

record_mol1 = Record("mol1")
record_mol1.add_properties([atomic_numbers, positions, total_energies])

new_dataset.add_record(record_mol1)

Approach 2: Create a Record within the dataset, add properties to this record within the dataset

[14]:

new_dataset.create_record('mol2')
new_dataset.add_properties("mol2", [atomic_numbers, positions, total_energies])

The dataset can also be instantiated with append_property set to True; the wrapper function within the dataset provides the same functionality as when interacting directly with a record.

[15]:

appendable_dataset = SourceDataset(name="appendable", append_property=True)

[ ]: