modelforge.curate : properties
This notebook will focus on a more thorough examination of defining properties.
[1]:
from modelforge.curate import Record, SourceDataset
from modelforge.utils.units import GlobalUnitSystem
from modelforge.curate.properties import AtomicNumbers, Positions, Energies, Forces, MetaData
from openff.units import unit
import numpy as np
Properties
Each property inherits from the PropertyBaseClass pydantic model and has the following fields:
name: str : unique identifier for the propertyvalue: ndarray : array containing the values (note, theMetaDataproperty allows this to be set to a str, int, float, and list in addition to a numpy array)units: unit.Unit : OpenFF.unitsclassification: PropertyClassification enum : specifies if the property is “atomic_numbers”, “per_atom”, “per_system”, or “meta_data”property_type: PropertyType enum: specifies the type of property (e.g., length, energy, force, etc.) used for validating the specifiedunits
classification and property_type are inherent to the property and do not need to be modified when a property is instantiated.
While a default value is set for name field for each property (e.g., “energies” for the Energies property), this value typically should be set at the time of instantiation to a unique and appropriate key. Setting the name field will be essentialy for records that contain, e.g., multiple energy entries (e.g., total_energy, dispersion_energy, electronic_energy, etc.).
The following demonstrates defining a record with properties “atomic_numbers”, “positions”, “total_energies”, “dispersion_energies”, and “smiles”
[2]:
atomic_numbers = AtomicNumbers(value=np.array([[1], [6]]))
positions = Positions(
value=np.array([[[1.0, 1.0, 1.0], [2.0, 2.0, 2.0]]]),
units="nanometer"
)
total_energies = Energies(
name="total_energies",
value=np.array([[1]]),
units=unit.hartree
)
dispersion_energies = Energies(
name="dispersion_energies",
value=np.array([[0.1]]),
units=unit.hartree
)
smiles = MetaData(name='smiles', value='[CH]')
record_mol1 = Record(name='mol1')
record_mol1.add_properties([total_energies, dispersion_energies, atomic_numbers, positions, smiles])
print(record_mol1)
name: mol1
* n_atoms: 2
* n_configs: 1
* atomic_numbers:
- name='atomic_numbers' value=array([[1],
[6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties: (['positions']):
- name='positions' value=array([[[1., 1., 1.],
[2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=1 n_atoms=2
* per-system properties: (['total_energies', 'dispersion_energies']):
- name='total_energies' value=array([[1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None
- name='dispersion_energies' value=array([[0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None
* meta_data: (['smiles'])
- name='smiles' value='[CH]' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None
As noted in the “basic_usage.ipynb” notebook, the name field is used as a unique key. An error will be raised if we try to add a property with the same key twice. E.g., the following will raise an error as we have already set the “total_energies”.
[3]:
record_mol1.add_property(total_energies)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[3], line 1
----> 1 record_mol1.add_property(total_energies)
File ~/PycharmProjects/modelforge/modelforge-curate/modelforge/curate/record.py:433, in Record.add_property(self, property)
429 error_msg = f"Property with name {property.name} already exists in the record {self.name}."
430 error_msg += (
431 f"Set append_property=True to append to the existing property."
432 )
--> 433 raise ValueError(error_msg)
435 assert (
436 self.per_system[property.name].value.shape[1]
437 == property.value.shape[1]
438 )
439 temp_array = property.value
ValueError: Property with name total_energies already exists in the record mol1.Set append_property=True to append to the existing property.
Appending properties
In some cases, we may not have data for all configurations available to use when instantiating a property. For example, the positions for different configurations may exist in different .xyz files. To handle these cases, the Record class can be instantiated with append_property set to True. In such cases, adding a property a second time will append the new data to the existing array.
For example, the following will use initialize the same Record as above, but allowing properties to be appended:abs
[4]:
record_mol1_append = Record(name='mol1', append_property="True")
record_mol1_append.add_properties([total_energies, dispersion_energies, atomic_numbers, positions, smiles])
Now, if we add “total_energies” a second time, this will not raise an error, rather it will append the energy to the existing array.
[5]:
record_mol1_append.add_property(total_energies)
If print the record we will now see that the “total_energies” property now contains value = [[1], [1]] and reports n_configs = 2.
[6]:
print(record_mol1_append)
2025-05-28 16:08:14.963 | WARNING | modelforge.curate.record:_validate_n_configs:265 - Number of configurations for properties in record mol1 are not consistent.
2025-05-28 16:08:14.965 | WARNING | modelforge.curate.record:_validate_n_configs:269 - - positions : 1
2025-05-28 16:08:14.966 | WARNING | modelforge.curate.record:_validate_n_configs:271 - - total_energies : 2
2025-05-28 16:08:14.967 | WARNING | modelforge.curate.record:_validate_n_configs:271 - - dispersion_energies : 1
name: mol1
* n_atoms: 2
* n_configs: cannot be determined, see warnings log
* atomic_numbers:
- name='atomic_numbers' value=array([[1],
[6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties: (['positions']):
- name='positions' value=array([[[1., 1., 1.],
[2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=1 n_atoms=2
* per-system properties: (['total_energies', 'dispersion_energies']):
- name='total_energies' value=array([[1],
[1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=2 n_atoms=None
- name='dispersion_energies' value=array([[0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None
* meta_data: (['smiles'])
- name='smiles' value='[CH]' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None
Note, this produces several warnings because the number of configurations is now not consistent in the record (printing the record calls the validate function in the class)
[7]:
record_mol1_append.validate()
2025-05-28 16:08:16.184 | WARNING | modelforge.curate.record:_validate_n_configs:265 - Number of configurations for properties in record mol1 are not consistent.
2025-05-28 16:08:16.185 | WARNING | modelforge.curate.record:_validate_n_configs:269 - - positions : 1
2025-05-28 16:08:16.187 | WARNING | modelforge.curate.record:_validate_n_configs:271 - - total_energies : 2
2025-05-28 16:08:16.188 | WARNING | modelforge.curate.record:_validate_n_configs:271 - - dispersion_energies : 1
[7]:
False
To resolve this we simply can add the “positions” and “dispersion_energies” a second time as well:
[8]:
record_mol1_append.add_properties([dispersion_energies, positions])
[9]:
print(record_mol1_append)
name: mol1
* n_atoms: 2
* n_configs: 2
* atomic_numbers:
- name='atomic_numbers' value=array([[1],
[6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties: (['positions']):
- name='positions' value=array([[[1., 1., 1.],
[2., 2., 2.]],
[[1., 1., 1.],
[2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=2 n_atoms=2
* per-system properties: (['total_energies', 'dispersion_energies']):
- name='total_energies' value=array([[1],
[1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=2 n_atoms=None
- name='dispersion_energies' value=array([[0.1],
[0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=2 n_atoms=None
* meta_data: (['smiles'])
- name='smiles' value='[CH]' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None
When appending to an existing property, the code will first check to see if the shapes of the array are compatible. For example, if we try to add positions for a molecule with a different number of atoms, this will produce an error, as the shapes of the arrays are not compatible.
[10]:
positions2 = Positions(value= [[[1,1,1], [2,2,2], [3,3,3]]], units=unit.nanometer)
record_mol1_append.add_property(positions2)
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[10], line 3
1 positions2 = Positions(value= [[[1,1,1], [2,2,2], [3,3,3]]], units=unit.nanometer)
----> 3 record_mol1_append.add_property(positions2)
File ~/PycharmProjects/modelforge/modelforge-curate/modelforge/curate/record.py:385, in Record.add_property(self, property)
380 raise ValueError(error_msg)
381 # if the property already exists, we will use vstack to add it to the existing array
382 # after first checking that the dimensions are consistent
383 # note we do not check shape[0], as that corresponds to the number of configurations
384 assert (
--> 385 self.per_atom[property.name].value.shape[1]
386 == property.value.shape[1]
387 ), f"{self.name}: n_atoms of {property.name} does not: {property.value.shape[1]} != {self.per_atom[property.name].value.shape[1]}."
388 assert (
389 self.per_atom[property.name].value.shape[2]
390 == property.value.shape[2]
391 )
392 # In order to append to the array, everything needs to have the same units
393 # We will use the units of the first property that was added
AssertionError: mol1: n_atoms of positions does not: 3 != 2.
The units are also compared and converted if necessary before appending. For example, we defined energy in units of hartree above; if we define energy in a different unit and append, it will automatically be converted to hartrees.
[11]:
total_energies2 = Energies(
name="total_energies",
value=np.array([[1]]),
units=unit.kilocalories_per_mole
)
record_mol1_append.add_property(total_energies2)
print(record_mol1_append)
2025-05-28 16:08:20.506 | WARNING | modelforge.curate.record:_validate_n_configs:265 - Number of configurations for properties in record mol1 are not consistent.
2025-05-28 16:08:20.508 | WARNING | modelforge.curate.record:_validate_n_configs:269 - - positions : 2
2025-05-28 16:08:20.509 | WARNING | modelforge.curate.record:_validate_n_configs:271 - - total_energies : 3
2025-05-28 16:08:20.509 | WARNING | modelforge.curate.record:_validate_n_configs:271 - - dispersion_energies : 2
name: mol1
* n_atoms: 2
* n_configs: cannot be determined, see warnings log
* atomic_numbers:
- name='atomic_numbers' value=array([[1],
[6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties: (['positions']):
- name='positions' value=array([[[1., 1., 1.],
[2., 2., 2.]],
[[1., 1., 1.],
[2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=2 n_atoms=2
* per-system properties: (['total_energies', 'dispersion_energies']):
- name='total_energies' value=array([[1. ],
[1. ],
[0.0015936]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=3 n_atoms=None
- name='dispersion_energies' value=array([[0.1],
[0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=2 n_atoms=None
* meta_data: (['smiles'])
- name='smiles' value='[CH]' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None
Adding properties directly to a dataset
Rather than creating an instance of the Record class and adding this to the dataset, we can use the SourceDataset class directly. The functions in SourceDataset effectively just provide wrappers to the functions that exist within the Record class. As such, both approaches are equivalent but one may be more convenient depending on the structure of the original dataset that is being curated.
The following code performs the same functionality in the two ways. First we will define the common elements (i.e., properties):
[12]:
#define the datset
new_dataset = SourceDataset('test_dataset')
# define the properties
atomic_numbers = AtomicNumbers(value=np.array([[1], [6]]))
positions = Positions(
value=np.array([[[1.0, 1.0, 1.0], [2.0, 2.0, 2.0]]]),
units="nanometer"
)
total_energies = Energies(
name="total_energies",
value=np.array([[1]]),
units=unit.hartree
)
2025-05-28 16:08:21.307 | WARNING | modelforge.curate.sourcedataset:__init__:66 - Database file test_dataset.sqlite already exists in ./. Removing it.
Approach 1: Create a Record, add properties to the Record, add Record to the dataset
[13]:
record_mol1 = Record("mol1")
record_mol1.add_properties([atomic_numbers, positions, total_energies])
new_dataset.add_record(record_mol1)
Approach 2: Create a Record within the dataset, add properties to this record within the dataset
[14]:
new_dataset.create_record('mol2')
new_dataset.add_properties("mol2", [atomic_numbers, positions, total_energies])
The dataset can also be instantiated with append_property set to True; the wrapper function within the dataset provides the same functionality as when interacting directly with a record.
[15]:
appendable_dataset = SourceDataset(name="appendable", append_property=True)
[ ]: