Dataset Module

The dataset module in modelforge provides a suite of functions and classes designed to retrieve and transform quantum mechanics (QM) datasets into a format compatible with torch.utils.data.Dataset as well as Pytorch Lightning LightningDataModule, facilitating the training of machine learning potentials. The module supports actions related to data storage, caching, retrieval, and the conversion of stored modelforge curated HDF5 files into PyTorch-compatible datasets for training purposes.

Modelforge currently provides a host of datasets containing a variety of molecular structures and properties. These datasets are curated into HDF5 formated files designed to be compatible with modelforge and hosted on zenodo.org (see the zenodo modelforge community); the udnerlying HDF5Dataset class provides a framework to download, cache, and process these files into a format compatible with torch.utils.data.Dataset, as previously noted. Local datasets can also be used that are stored in modelforge compatible HDF5 formats, allowing users to work with their own datasets without needing to upload them to a remote server or modifying the modelforge source. These can be specified by providing a configuration file, as will be described below.

Dataset Configuration TOML file

Dataset input configuration is typically managed using a TOML file. This configuration file is crucial during the training process as it provides values that need to be specified for the DataModule class, ensuring a flexible and customizable setup.

Below is a minimal example of a dataset configuration for the QM9 dataset.

QM9 Dataset Configuration

[dataset]
dataset_name = "QM9"
version_select = "nc_1000_v1.2"
num_workers = 4
pin_memory = true
properties_of_interest = ["atomic_numbers", "positions", "internal_energy_at_0K", "dipole_moment_per_system"]
element_filter = []
regresssion_ase = false


[dataset.properties_assignment]
atomic_numbers = "atomic_numbers"
positions = "positions"
E = "internal_energy_at_0K"

Warning

The version_select field in the example indicates the use of a small subset of the QM9 dataset. To utilize the full dataset, set this variable to latest.

Explanation of the possible fields in the dataset configuration file:

dataset_name: Specifies the name of the dataset. For this example, it is QM9.
version_select: Indicates the version of the dataset to use. In this example, it points to a small subset of the dataset for quick testing. To use the full QM9 dataset, set this variable to latest.
number_of_worker: Determines the number of worker threads for data loading. Increasing the number of workers can speed up data loading but requires more memory. Must be 1 or greater.
pin_memory: A boolean flag indicating whether to pin memory for faster data transfer to the GPU. This is useful when training on a GPU and can improve performance by reducing data transfer times. Defaults to True.
properties_of_interest: Lists the properties of interest to load from the hdf5 file. This should include the properties that are relevant for training the model. The properties listed here must match those available in the dataset metadata; otherwise, a validation error will be raised. Loading properties that will not be used during training will use more memory.
properties_assignment: Maps the properties of interest to the corresponding fields in the dataset. This mapping is crucial for the correct loading of properties during training; note, many datasets contain multiple properties can potentially be swapped (e.g., energy calculated with or without dispersion corrections, different charge population schemes, different levels of theory, etc.). Any properties listed here must appear in the properties of interest list; the code will raise a validation error if this condition is not met. The possible fields to assign are defined by the PropertyNames, which is listed below. Note, by default atomic_numbers, positions, and energy (E) are always required to be set.

class PropertyNames:
    atomic_numbers: str # per-atom atomic numbers (atomic numbers are integers)
    positions: str  # per-atom positions (cartesian coordinates)
    E: str  # per-system energy (total energy)
    F: Optional[str] = None  # per-atom forces
    total_charge: Optional[str] = None  # per-system total charge
    dipole_moment: Optional[str] = None  # per-system dipole moment
    spin_multiplicity: Optional[str] = None  # per-system spin multiplicity
    partial_charges: Optional[str] = None  # per-atom partial charges
    quadrupole_moment: Optional[str] = None  # per-system quadrupole moment

element_filter: A filter to select systems with or without certain elements, which are denoted by atomic numbers. If a positive number is provided, then a datapoint that includes that element will be included. A negative values indicates which elements to exclude. For example, [[29]], selects all systems containing copper (29). [[29, -17]] selects all systems containing copper (29), but excludes from that list any that also contain chlorine (17). [[29, 1, -17]] would select all systems that contain copper (29) and hydrogen (H), and do not include chlorine (17). Everything contain within the same brackets acts as an “and” (i.e., all criteria must be satisfied). Providing two separate sublists acts as an “or”. For example, [[29,1], [78,-17]], states that a molecule can either have [copper (29) and hydrogen (1)] OR [platinum (78) and not chlorine (17)]. Leaving this field as an empty list or remove it will disable this element filtering feature.
regression_ase: A boolean flag indicating whether to use the atomic self-energies provided by the dataset (if available) or to calculate them via regression. If set to True, the atomic self-energies will be used as provided in the dataset metadata; if set to False, the self-energies will be calculated via regression. This is Optional and defaults to False.

Other fields that can be specified in the dataset configuration file include:

local_yaml_file: A path to a local dataset yaml file. This is Optional and defaults to None. If specified, it will be used to load the dataset metadata instead of the default metadata files provided by modelforge. This allows users to work with their own datasets without needing to upload them to a remote server or modifying the modelforge source.
dataset_cache_dir: Specifies the directory where the dataset files will be cached. This is useful for storing the dataset files locally to avoid downloading them multiple times; can be shared between multiple training runs.

Processing of dataset entries

Other common operations that are performed on the dataset as part of training machine learned potentials. These are defined in the training toml file:

Removing Self-Energies: Self-energies are per-element offsets subtracted to the total energy of a system. The energy offsets provide cleaner training data (e.g., MAE values of energy are closer to the scale of the energy itself).
Shifting the Energies: The energies can be shifted by a constant value potentially improving the stability and speed of training. This shifting can be set to be the minimum, maximum, or mean of the training dataset energies. The minimum energy shifting will shift by the smallest value, hence making all values positive; maximum shifting will make all values negative; mean shifting will center the energies around zero.
Splitting the Dataset: The dataset are split into training, validation, and test sets. This is crucial for evaluating the performance of the machine learning model and ensuring that it generalizes well to unseen data. Various schemes can be used to specify this.
Shifting the center of mass: The center of mass of the system can be shifted to the origin to enable calculation of the dipole moment.
Normalization and Scaling: Normalize the energies and other properties to ensure they are on a comparable scale, which can improve the stability and performance of the machine learning model. Note that this is done when atomic energies are predicted, i.e. the atomic energy (E_i) is scaled using the atomic energy distribution obtained from the training dataset: E_i = E_i_stddev * E_i_pred + E_i_mean.

However, note that these operations are not defined within the dataset configuration; these are specified in the training (self-energy, splitting, shifting COM) and potential (normalization) configuration TOML files.

Interacting with the Dataset Module

Here, we provide a brief overview of the DataModule class. Note, users will typically interact with this portion of the code indirectly via the TOML configuration files. The DataModule class handles preparing and setting up datasets for training. and is designed to integrate seamlessly with PyTorch Lightning, providing a user-friendly interface for dataset preparation and loading.

The following example demonstrates how to use the DataModule class to prepare and set up a dataset for training, where the similarity to the TOML configuration file should be evident.

from modelforge.dataset import DataModule
from modelforge.dataset.utils import RandomRecordSplittingStrategy

dataset_name = "QM9"
splitting_strategy = RandomRecordSplittingStrategy() # split randomly on system level
batch_size = 64
version_select = "latest"
remove_self_energies = True # remove the atomic self energies
regression_ase = False      # use the atomic self energies provided by the dataset

data_module = DataModule(
    name=dataset_name,
    properties_of_interest=["atomic_numbers", "positions", "internal_energy_at_0K"]
    properties_assignment={
        "E": "energy",
        "atomic_numbers": "atomic_numbers",
        "positions": "positions",
    },
    splitting_strategy=splitting_strategy,
    batch_size=batch_size,
    version_select=version_select,
    remove_self_energies=remove_self_energies,
    regression_ase=regression_ase,
    local_cache_dir="~/modelforge_run",
    dataset_cache_dir="~/modelforge_hdf5_files",
)

# Prepare the data (downloads, processes, and caches if necessary)
data_module.prepare_data()

# Setup the data for training, validation, and testing
data_module.setup()

yaml Metadata File Structure

The HDF5Dataset class is designed to provide a generic class for loading in modelforge compatible HDF5 files. This relies upon reading in a YAML file which provide essential information about a given dataset, including the available versions, properties, and other relevant details, along with the downloard url used to fetch the dataset. These YAML metadata files are stored in the ~modelforge/dataset/yaml_files directory for the datasets provided by modelforge.

Below is a fictional example of a metadata YAML to demonstrate the key fields which includes the dataset name, version, description, atomic self-energies, and available properties.

dataset: fictional_dataset_name
latest: full_dataset_v1.1 # an alias for the lastest version of the full dataset
latest_test: nc_1000_v1.1 # an alias for the lastest version of the 1000 configuration test dataset

description: "A description of the dataset."

atomic_self_energies:
  H: -1400.0 * kilojoule_per_mole
  C: -10000.0 * kilojoule_per_mole

full_dataset_v1.1:
  about: "This provides a curated hdf5 file for the fictional dataset designed to be compatible
    with modelforge. This dataset contains 1234 unique records for 123456 total
    configurations."
  hdf5_schema: 2 # This specifies which modelforge HDF5 schema the version uses.
  available_properties: # list of properties keys available in the dataset
  - atomic_numbers
  - positions
  - dft_energy
  remote_dataset:
    doi: 10.1234/fictional_dataset.v1.1 # The DOI for the zenodo record of the dataset
    url: https://zenodo.org/records/record_id/files/fictional_dataset_v1.1.hdf5.gz # The URL to download the gzipped HDF5 file
    gz_data_file:
      file_name: fictional_dataset_v1.1.hdf5.gz #name of the gzipped file that will be saved locally
      length: 123456 # Length of the gzipped file in bytes, used for the progress bar
      md5: gzip_checksum_value # The MD5 checksum of the gzipped file, used to verify the integrity of the downloaded file
    hdf5_data_file:
      file_name: fictional_dataset_v1.1.hdf5 # The name of the HDF5 file that will be saved locally after unzipping
      md5: hdf5_checksum_value # The MD5 checksum of the HDF5 file, used to verify the integrity of the downloaded file

Note, HDF5 datafile stored on zenodo.org are stored as gzipped files to save space and bandwidth when downloading.

To specify metadata for a local dataset, the remote_dataset field can be omitted and replaced with the field local_dataset as shown below:

Available Datasets and Versions

Below is a description of the curated datasets currently available for modelforge and their corresponding metadata yaml files. These files can be found in the ~modelforge/dataset/yaml_files directory. The YAML files provide detailed information about each dataset, including the versions, properties, self energies and download URLs. As previously mentioned, for each dataset, multiple versions may be available. A 1000 configuration test dataset is provided for each dataset primarily useful for testing; several datasets also provide various subsets (e.g., limited to a subset of elements).

The dataset names used to specify the dataset in modelforge are provided in parentheses:

ANI1x (ani1x): dataset includes ~5 million density function theory calculations for small organic molecules containing H, C, N, and O. A subset of ~500k are computed with accurate coupled cluster methods.

ANI-1x dataset:
Smith, J. S.; Nebgen, B.; Lubbers, N.; Isayev, O.; Roitberg, A. E. Less Is More: Sampling Chemical Space with Active Learning. J. Chem. Phys. 2018, 148 (24), 241733. https://doi.org/10.1063/1.5023802

ANI-1ccx dataset:
Smith, J. S.; Nebgen, B. T.; Zubatyuk, R.; Lubbers, N.; Devereux, C.; Barros, K.; Tretiak, S.; Isayev, O.; Roitberg, A. E. Approaching Coupled Cluster Accuracy with a General-Purpose Neural Network Potential through Transfer Learning. Nat. Commun. 2019, 10 (1), 2903. https://doi.org/10.1038/s41467-019-10827-4

ωB97x/def2-TZVPP data:
Zubatyuk, R.; Smith, J. S.; Leszczynski, J.; Isayev, O. Accurate and Transferable Multitask Prediction of Chemical Properties with an Atoms-in-Molecules Neural Network. Sci. Adv. 2019, 5 (8), eaav6490. https://doi.org/10.1126/sciadv.aav6490

ANI1x Dataset yaml Metadata

dataset: ani1x
latest: full_dataset_v1.1
latest_test: nc_1000_v1.1
description: "ANI1x  dataset includes ~5 million density function theory calculations
        for small organic molecules containing H, C, N, and O.
        A subset of ~500k are computed with accurate coupled cluster methods.

        References:

        ANI-1x dataset:
        Smith, J. S.; Nebgen, B.; Lubbers, N.; Isayev, O.; Roitberg, A. E.
        Less Is More: Sampling Chemical Space with Active Learning.
        J. Chem. Phys. 2018, 148 (24), 241733.
        https://doi.org/10.1063/1.5023802
        https://arxiv.org/abs/1801.09319

        ANI-1ccx dataset:
        Smith, J. S.; Nebgen, B. T.; Zubatyuk, R.; Lubbers, N.; Devereux, C.; Barros, K.; Tretiak, S.; Isayev, O.; Roitberg, A. E.
        Approaching Coupled Cluster Accuracy with a General-Purpose Neural Network Potential through Transfer Learning. N
        at. Commun. 2019, 10 (1), 2903.
        https://doi.org/10.1038/s41467-019-10827-4

        wB97x/def2-TZVPP data:
        Zubatyuk, R.; Smith, J. S.; Leszczynski, J.; Isayev, O.
        Accurate and Transferable Multitask Prediction of Chemical Properties with an Atoms-in-Molecules Neural Network.
        Sci. Adv. 2019, 5 (8), eaav6490.
        https://doi.org/10.1126/sciadv.aav6490"

atomic_self_energies:
  H: -0.5978583943827134 * hartree
  C: -38.08933878049795 * hartree
  N: -54.711968298621066 * hartree
  O: -75.19106774742086 * hartree

full_dataset_v1.1:
  about: "This provides a curated hdf5 file for the ANI-1x dataset designed to be compatible
    with modelforge. This dataset contains 3114 unique records for 4956005 total configurations.
    Note, individual configurations are partitioned into entries based on the array
    of atomic species appearing in sequence in the source data file."
  available_properties:
  - atomic_numbers
  - positions
  - wb97x_dz_energy
  - wb97x_tz_energy
  - ccsd(t)_cbs_energy
  - hf_dz_energy
  - hf_tz_energy
  - hf_qz_energy
  - npno_ccsd(t)_dz_corr_energy
  - npno_ccsd(t)_tz_corr_energy
  - tpno_ccsd(t)_dz_corr_energy
  - mp2_dz_corr_energy
  - mp2_tz_corr_energy
  - mp2_qz_corr_energy
  - wb97x_dz_forces
  - wb97x_tz_forces
  - wb97x_dz_dipole
  - wb97x_tz_dipole
  - wb97x_dz_quadrupole
  - wb97x_dz_cm5_charges
  - wb97x_dz_hirshfeld_charges
  - wb97x_tz_mbis_charges
  hdf5_schema: 2
  remote_dataset:
    doi: 10.5281/zenodo.15447970
    url: https://zenodo.org/records/15447970/files/ani1x_dataset_v1.1.hdf5.gz
    gz_data_file:
      file_name: ani1x_dataset_v1.1.hdf5.gz
      length: 3514221240
      md5: 0a93b1da5b36298cba7d6b14f7f65ded
    hdf5_data_file:
      file_name: ani1x_dataset_v1.1.hdf5
      md5: b973e519602d24eb4a288e135875ea7e

nc_1000_v1.1:
  about: "This provides a curated hdf5 file for a subset of the ANI-1x dataset designed
    to be compatible with modelforge. This dataset contains 135 unique records for
    1000 total configurations, with a maximum of 10 configurations per record. Note,
    individual configurations are partitioned into entries based on the array of atomic
    species appearing in sequence in the source data file."
  available_properties:
    - atomic_numbers
    - positions
    - wb97x_dz_energy
    - wb97x_tz_energy
    - ccsd(t)_cbs_energy
    - hf_dz_energy
    - hf_tz_energy
    - hf_qz_energy
    - npno_ccsd(t)_dz_corr_energy
    - npno_ccsd(t)_tz_corr_energy
    - tpno_ccsd(t)_dz_corr_energy
    - mp2_dz_corr_energy
    - mp2_tz_corr_energy
    - mp2_qz_corr_energy
    - wb97x_dz_forces
    - wb97x_tz_forces
    - wb97x_dz_dipole
    - wb97x_tz_dipole
    - wb97x_dz_quadrupole
    - wb97x_dz_cm5_charges
    - wb97x_dz_hirshfeld_charges
    - wb97x_tz_mbis_charges
  hdf5_schema: 2
  remote_dataset:
    doi: 10.5281/zenodo.15447763
    url: https://zenodo.org/records/15447763/files/ani1x_dataset_v1.1_ntc_1000.hdf5.gz
    gz_data_file:
      file_name: ani1x_dataset_v1.1_ntc_1000.hdf5.gz
      length: 1426717
      md5: 4808bdbd49ae3cf7c2049bff439aaa8b
    hdf5_data_file:
      file_name: ani1x_dataset_v1.1_ntc_1000.hdf5
      md5: ac1bc889f45c09b6971f3b56428b61ca

ANI2X (ani2x): The ANI-2x data set includes properties for small organic molecules that contain H, C, N, O, S, F, and Cl. This dataset contains 9651712 conformers for nearly 200,000 molecules. This will fetch data generated with the ωB97X/631Gd level of theory used in the original ANI-2x paper, calculated using Gaussian 09.

Devereux, C, Zubatyuk, R., Smith, J. et al. Extending the applicability of the ANI deep learning molecular potential to sulfur and halogens. Journal of Chemical Theory and Computation 16.7 (2020): 4192-4202. https://doi.org/10.1021/acs.jctc.0c00121

Fe II (fe_ii): The Fe(II) dataset includes 28834 total configurations of 384 unique Fe(II) organometallic complexes. Specifically, this includes 15568 HS geometries and 13266 LS geometries. These complexes originate from the Cambridge Structural Database (CSD) as curated by Nandy, et al. (Journal of Physical Chemistry Letters (2023), 14 (25), 10.1021/acs.jpclett.3c01214), and were filtered into “computation-ready” complexes, (those where both oxidation states and charges are already specified without hydrogen atoms missing in the structures), following the procedure outlined by Arunachalam, et al. (Journal of Chemical Physics (2022), 157 (18), 10.1063/5.0125700).

Hongni Jin and Kenneth M. Merz Jr, Modeling Fe(II) Complexes Using Neural Networks. Journal of Chemical Theory and Computation 2024 20 (6), 2551-2558 https://dx.doi.org/10.1021/acs.jctc.4c00063

Fe II Dataset yaml Metadata

dataset: fe_ii
latest: full_version_v1.1
latest_test: nc_1000_v1.1

description: "This dataset contains 384 unique systems with a total of 28,834 configurations
    (note, the original publication states 383 unique systems).

    The full Fe(II) dataset includes 28834 total configurations of Fe(II) organometallic complexes.
    Specifically, this includes 15568 HS geometries and 13266 LS geometries.
    These complexes originate from the Cambridge Structural Database (CSD) as curated by Nandy, et al.
    (Journal of Physical Chemistry Letters (2023), 14 (25), 10.1021/acs.jpclett.3c01214),
    and were filtered into “computation-ready” complexes, (those where both oxidation states and charges are
    already specified without hydrogen atoms missing in the structures), following the procedure outlined by
    Arunachalam, et al. (Journal of Chemical Physics (2022), 157 (18), 10.1063/5.0125700)


    Citation to the original dataset:

        Modeling Fe(II) Complexes Using Neural Networks
        Hongni Jin and Kenneth M. Merz Jr.
        Journal of Chemical Theory and Computation 2024 20 (6), 2551-2558
        DOI: 10.1021/acs.jctc.4c00063 
    "
atomic_self_energies:
  H: -257.8658772400123 * kilojoule_per_mole
  C: -897.1371901363243 * kilojoule_per_mole
  N: -683.3438581909822 * kilojoule_per_mole
  O: -707.3905177027947 * kilojoule_per_mole
  P: -445.4451443983543 * kilojoule_per_mole
  S: -367.7922055565044 * kilojoule_per_mole
  Cl: -227.0568137730898 * kilojoule_per_mole
  Fe: 224.48679425562852 * kilojoule_per_mole

nc_1000_v1.1:
  hdf5_schema: 2
  available_properties:
    - atomic_numbers
    - positions
    - total_charge
    - forces
    - energies
    - spin_multiplicities
  about: "This provides a modelforge curated hdf5 file for the Fe (II) dataset.
          This dataset contains 102 unique systems with a total of 1000 configurations 
          (max of 10 configurations per system). "
  remote_dataset:
    doi: 10.5281/zenodo.15264766
    url: https://zenodo.org/records/15264766/files/fe_II_ntc_1000_v1.1.hdf5.gz
    gz_data_file:
      length: 1425316
      md5: 5337732f01cc99fac8c500c1df7a4b39
      file_name: Fe_II_dataset_nc1000_v1.1.hdf5.gz
    hdf5_data_file:
      md5: 824a03eb589b4bf46d07d12fbfab507d
      file_name: Fe_II_dataset_nc1000_v1.1.hdf5

full_version_v1.1:
  hdf5_schema: 2
  available_properties:
    - atomic_numbers
    - positions
    - total_charge
    - forces
    - energies
    - spin_multiplicities
  remote_dataset:
    doi: 10.5281/zenodo.15264721
    url: https://zenodo.org/records/15264721/files/fe_II_v1.1.hdf5.gz
    gz_data_file:
      length: 39631216
      md5: 55bc8488a1e115712b0c48a740ad73f1
      file_name: Fe_II_dataset_v1.1.hdf5.gz
    hdf5_data_file:
      md5: 7569fe3c7f8acdef5dc3f6340af51d35
      file_name: Fe_II_dataset_v1.1.hdf5

PhAlkEthOH (PhAlkEthOH): PhAlkEthOH: Phenyls, Alkanes, Ethers, and Alcohols (OH). The PhAlkEthOH dataset contains a collection of optimized trajectories of linear and cyclic molecules containing phyl rings, small alkanes, ethers, and alcohols containing only elements carbon, oxygen and hydrogen. For each unique system, configurations correspond to snapshots from the optimization trajectory. All QM datapoints were generated using B3LYP-D3BJ/DZVP level of theory, the default theory used for force field development by the Open Force Field Initiative.

Bannan CC, Mobley D. ChemPer: An Open Source Tool for Automatically Generating SMIRKS Patterns. ChemRxiv. 2019; https://dx.doi.org/10.26434/chemrxiv.8304578.v1

Wang Y, Fass J, Kaminow B, Herr JE, Rufa D, Zhang I, Pulido I, Henry M, Macdonald HE, Takaba K, Chodera JD. End-to-end differentiable construction of molecular mechanics force fields. Chemical Science. 2022;13(41):12016-33. https://dx.doi.org/10.1039/d2sc02739a

PhAlkEthOH Dataset yaml Metadata

dataset: PhAlkEthOH
latest: full_dataset_v1.1
latest_test: nc_1000_v1.1

description: "PhAlkEthOH: Phenyls, Alkanes, Ethers, and Alcohols (OH) 
            The PhAlkEthOH dataset contains a collection of optimized trajectories of linear and cyclic molecules 
            containing phyl rings, small alkanes, ethers, and alcohols containing only elements carbon, oxygen and hydrogen.  
            For each unique system, configurations correspond to snapshots from the optimization trajectory.  
            
            All QM datapoints were generated using B3LYP-D3BJ/DZVP level of theory, the default theory used for force field 
            development by the Open Force Field Initiative.
            
            The dataset was retrieved from The MolSSI qcarchive.  
            
            Related manuscripts:
          
            Bannan CC, Mobley D. 
            ChemPer: An Open Source Tool for Automatically Generating SMIRKS Patterns. 
            ChemRxiv. 2019; doi:10.26434/chemrxiv.8304578.v1
            
            Wang Y, Fass J, Kaminow B, Herr JE, Rufa D, Zhang I, Pulido I, Henry M, Macdonald HE, Takaba K, Chodera JD. 
            End-to-end differentiable construction of molecular mechanics force fields. 
            Chemical Science. 2022;13(41):12016-33. doi:10.1039/d2sc02739a
          
            Repository used for generating and submitting the dataset via MolSSI qcfractal:
            
            Gokey, T,., 
            OpenFF Sandbox CHO PhAlkEthOH v1.0, 2020, 
            https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2020-09-18-OpenFF-Sandbox-CHO-PhAlkEthOH
            "
atomic_self_energies:
  H: -1596.6973305434612*kilojoule_per_mole
  C: -100059.79872980758*kilojoule_per_mole
  O: -197491.36594960644*kilojoule_per_mole

full_dataset_v1.1:
  about: 'This provides a curated hdf5 file for the PhAlkEthOH dataset designed to
    be compatible with modelforge. This dataset contains 10301 unique records for
    1188691 total configurations.  This excludes any configurations where the magnitude of any forces 
    on the atoms are greater than 1 hartree/bohr.'

  available_properties:
  - atomic_numbers
  - positions
  - total_charge
  - dispersion_correction_energy
  - dft_total_energy
  - dispersion_correction_gradient
  - dispersion_correction_force
  - dft_total_gradient
  - dft_total_force
  - scf_dipole
  hdf5_schema: 2
  remote_dataset:
    doi: 10.5281/zenodo.15398204
    url: https://zenodo.org/records/15398204/files/PhAlkEthOH_openff_dataset_v1.1.hdf5.gz
    gz_data_file:
      file_name: PhAlkEthOH_openff_dataset_v1.1.hdf5.gz
      length: 5445897672
      md5: 5bd91d2533581478d35b1c32472c22a7
    hdf5_data_file:
      file_name: PhAlkEthOH_openff_dataset_v1.1.hdf5
      md5: 643a6ff5387088d7cb0c70b5ba39a027

nc_1000_v1.1:
  about: 'This provides a curated hdf5 file for a subset of the PhAlkEthOH dataset
    designed to be compatible with modelforge. This dataset contains 101 unique records
    for 1000 total configurations, with a maximum of 10 configurations per record. 
    This excludes any configurations where the magnitude of any forces on the atoms are greater than 1 hartree/bohr.'
  available_properties:
  - atomic_numbers
  - positions
  - total_charge
  - dispersion_correction_energy
  - dft_total_energy
  - dispersion_correction_gradient
  - dispersion_correction_force
  - dft_total_gradient
  - dft_total_force
  - scf_dipole
  hdf5_schema: 2
  remote_dataset:
    doi: 10.5281/zenodo.15417002
    url: https://zenodo.org/records/15417002/files/PhAlkEthOH_openff_dataset_v1.1_ntc_1000.hdf5.gz
    gz_data_file:
      file_name: PhAlkEthOH_openff_dataset_v1.1_ntc_1000.hdf5.gz
      length: 4053133
      md5: d80e9ea3318dfeb3a40f2d614ca62dec
    hdf5_data_file:
      file_name: PhAlkEthOH_openff_dataset_v1.1_ntc_1000.hdf5
      md5: 1e9160bfba6e8bf2c4b9677a7992a000

nc_1000_minimal_v1.1:
  about: 'This provides a curated hdf5 file for a subset of the PhAlkEthOH dataset
    designed to be compatible with modelforge. This dataset contains 1000 unique records
    for 1000 total configurations, with only the final configuration of the optimization.
    This excludes any configurations where the magnitude of any forces on the atoms are greater than 1 hartree/bohr.'

  available_properties:
  - atomic_numbers
  - positions
  - total_charge
  - dispersion_correction_energy
  - dft_total_energy
  - dispersion_correction_gradient
  - dispersion_correction_force
  - dft_total_gradient
  - dft_total_force
  - scf_dipole
  hdf5_schema: 2
  remote_dataset:
    doi: 10.5281/zenodo.15419953
    url: https://zenodo.org/records/15419953/files/PhAlkEthOH_openff_dataset_v1.1_ntc_1000_minimal.hdf5.gz
    gz_data_file:
      file_name: PhAlkEthOH_openff_dataset_v1.1_ntc_1000_minimal.hdf5.gz
      length: 5277672
      md5: 03ac17b57f163baaf996a707ac281eb1
    hdf5_data_file:
      file_name: PhAlkEthOH_openff_dataset_v1.1_ntc_1000_minimal.hdf5
      md5: 612036eca20f9bb715cafd4a897be8d0

full_dataset_minimal_v1.1:
  about: 'This provides a curated hdf5 file for the PhAlkEthOH dataset designed to
    be compatible with modelforge. This dataset contains 10301 unique records for
    10301 total configurations, with only the final configuration of the optimization.
    This excludes any configurations where the magnitude of any forces on the atoms are greater than 1 hartree/bohr.'

  available_properties:
    - atomic_numbers
    - positions
    - total_charge
    - dispersion_correction_energy
    - dft_total_energy
    - dispersion_correction_gradient
    - dispersion_correction_force
    - dft_total_gradient
    - dft_total_force
    - scf_dipole
  hdf5_schema: 2
  remote_dataset:
    doi: 10.5281/zenodo.15418579
    url: https://zenodo.org/records/15418579/files/PhAlkEthOH_openff_dataset_v1.1_minimal.hdf5.gz
    gz_data_file:
      file_name: PhAlkEthOH_openff_dataset_v1.1_minimal.hdf5.gz
      length: 38338045
      md5: 2b7784da858c566b93d032a1faa00ad3
    hdf5_data_file:
      file_name: PhAlkEthOH_openff_dataset_v1.1_minimal.hdf5
      md5: f35154b3114824cc3d882e9a53436c80

QM9 (qm9): A dataset of 134k small organic molecules, each containing up to 9 heavy atoms (C, O, N, F) and up to 29 atoms in total. It includes properties such as energies, forces, and dipole moments.

Ramakrishnan, R., Dral, P., Rupp, M. et al. ‘Quantum chemistry structures and properties of 134 kilo molecules.’Sci Data 1, 140022 (2014). https://doi.org/10.1038/sdata.2014.22

QM9 Dataset yaml Metadata

dataset: qm9
latest: full_dataset_v1.2
latest_test: nc_1000_v1.2

description: "The QM9 dataset includes 133,885 organic molecules with up to nine total heavy atoms (C,O,N,or F; excluding H).
              All properties were calculated at the B3LYP/6-31G(2df,p) level of quantum chemistry.
              
              Citation: 
              
              Ramakrishnan, R., Dral, P., Rupp, M. et al.
              'Quantum chemistry structures and properties of 134 kilo molecules.'
              Sci Data 1, 140022 (2014).
              https://doi.org/10.1038/sdata.2014.22
              
              DOI for dataset: 10.6084/m9.figshare.c.978904.v5"

atomic_self_energies:
  H: -1313.4668615546 * kilojoule_per_mole
  C: -99366.70745535441 * kilojoule_per_mole
  N: -143309.9379722722 * kilojoule_per_mole
  O: -197082.0671774158 * kilojoule_per_mole
  F: -261811.54555874597 * kilojoule_per_mole


full_dataset_v1.2:
  about: "This provides a curated hdf5 file for the qm9 dataset designed to be compatible
    with modelforge. This dataset contains 133885 unique records for 133885 total
    configurations. Note, the dataset contains only a single configuration per record.
    This includes some minor corrections to the v1.1 dataset. Note, the dipole_moment_per_system
    and dipole_moment_scalar_per_system properties are calculated from the partial_charges"
  hdf5_schema: 2
  available_properties:
  - atomic_numbers
  - positions
  - partial_charges
  - polarizability
  - dipole_moment_per_system
  - dipole_moment_scalar_per_system
  - energy_of_homo
  - lumo-homo_gap
  - zero_point_vibrational_energy
  - internal_energy_at_298.15K
  - internal_energy_at_0K
  - enthalpy_at_298.15K
  - free_energy_at_298.15K
  - heat_capacity_at_298.15K
  - rotational_constants
  - harmonic_vibrational_frequencies
  - electronic_spatial_extent
  remote_dataset:
    doi: 10.5281/zenodo.17536462
    url: https://zenodo.org/records/17536462/files/qm9_dataset_v1.2.hdf5.gz
    gz_data_file:
      file_name: qm9_dataset_v1.2.hdf5.gz
      length: 301536746
      md5: b53d5b83f1f24d7c6aa80612b9bd16dd
    hdf5_data_file:
      file_name: qm9_dataset_v1.2.hdf5
      md5: 60ab35fed8d9a99be059cefb39f1f4b4

full_dataset_v1.1:
  about: "This provides a curated hdf5 file for the qm9 dataset designed to be compatible
    with modelforge. This dataset contains 133885 unique records for 133885 total
    configurations. Note, the dataset contains only a single configuration per record."
  hdf5_schema: 2
  available_properties:
  - atomic_numbers
  - positions
  - partial_charges
  - polarizability
  - dipole_moment_per_system
  - dipole_moment_scalar_per_system
  - energy_of_homo
  - lumo-homo_gap
  - zero_point_vibrational_energy
  - internal_energy_at_298.15K
  - internal_energy_at_0K
  - enthalpy_at_298.15K
  - free_energy_at_298.15K
  - heat_capacity_at_298.15K
  - rotational_constants
  - harmonic_vibrational_frequencies
  - electronic_spatial_extent
  remote_dataset:
    doi: 10.5281/zenodo.15390655
    url: https://zenodo.org/records/15390655/files/qm9_dataset_v1.1.hdf5.gz
    gz_data_file:
      file_name: qm9_dataset_v1.1.hdf5.gz
      length: 301537815
      md5: 62d17d98d8143ac34f88bf1300b686c6
    hdf5_data_file:
      file_name: qm9_dataset_v1.1.hdf5
      md5: 04e4c86d59374912849c64e899894719

nc_1000_v1.2:
  about: "This provides a curated hdf5 file for a subset of the qm9 dataset designed
    to be compatible with modelforge. This dataset contains 1000 unique records for
    1000 total configurations. Note, the dataset contains only a single configuration
    per record."
  hdf5_schema: 2
  available_properties:
  - atomic_numbers
  - positions
  - partial_charges
  - polarizability
  - dipole_moment_per_system
  - dipole_moment_scalar_per_system
  - energy_of_homo
  - lumo-homo_gap
  - zero_point_vibrational_energy
  - internal_energy_at_298.15K
  - internal_energy_at_0K
  - enthalpy_at_298.15K
  - free_energy_at_298.15K
  - heat_capacity_at_298.15K
  - rotational_constants
  - harmonic_vibrational_frequencies
  - electronic_spatial_extent
  remote_dataset:
    doi: 10.5281/zenodo.17536526
    url: https://zenodo.org/records/17536526/files/qm9_dataset_v1.2_ntc_1000.hdf5.gz
    gz_data_file:
      file_name: qm9_dataset_v1.2_ntc_1000.hdf5.gz
      length: 1923749
      md5: a6cf9528b4f2db977b96f7a441ba557c
    hdf5_data_file:
      file_name: qm9_dataset_v1.2_ntc_1000.hdf5
      md5: befb3ef66d74f436ef399bf68eda9b90

nc_1000_v1.1:
  about: "This provides a curated hdf5 file for a subset of the qm9 dataset designed
    to be compatible with modelforge. This dataset contains 1000 unique records for
    1000 total configurations. Note, the dataset contains only a single configuration
    per record."
  hdf5_schema: 2
  available_properties:
  - atomic_numbers
  - positions
  - partial_charges
  - polarizability
  - dipole_moment_per_system
  - dipole_moment_scalar_per_system
  - energy_of_homo
  - lumo-homo_gap
  - zero_point_vibrational_energy
  - internal_energy_at_298.15K
  - internal_energy_at_0K
  - enthalpy_at_298.15K
  - free_energy_at_298.15K
  - heat_capacity_at_298.15K
  - rotational_constants
  - harmonic_vibrational_frequencies
  - electronic_spatial_extent
  remote_dataset:
    doi: 10.5281/zenodo.15390593
    url: https://zenodo.org/records/15390593/files/qm9_dataset_v1.1_ntc_1000.hdf5.gz
    gz_data_file:
      file_name: qm9_dataset_v1.1_ntc_1000.hdf5.gz
      length: 1923749
      md5: 54a2471bba075fcc2cdfe0b78bc567fa
    hdf5_data_file:
      file_name: qm9_dataset_v1.1_ntc_1000.hdf5
      md5: befb3ef66d74f436ef399bf68eda9b90

SPICE 1 (spice1): The SPICE dataset contains 1.1 million conformations for a 19238 unique small molecules, dimers, dipeptides, and solvated amino acids. It includes 15 elements (H, Li, C, N, O, F, Na, Mg, P, S, Cl, K, Ca, Br, I)., charged and uncharged molecules, and a wide range of covalent and non-covalent interactions. It provides both forces and energies calculated at the ωB97M-D3(BJ)/def2-TZVPPD level of theory, using Psi4 1.4.1 along with other useful quantities such as multipole moments and bond orders.

Eastman, P., Behara, P.K., Dotson, D.L. et al. SPICE, A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials. Sci Data 10, 11 (2023). https://doi.org/10.1038/s41597-022-01882-6

SPICE 1 Dataset yaml Metadata

dataset: spice1
latest: full_dataset_v1.1
latest_test: nc_1000_v1.1

description: "The SPICE dataset contains 1.1 million conformations for a diverse set of small molecules,
    dimers, dipeptides, and solvated amino acids. It includes 15 elements (H, Li, C, N, O, F, Na, Mg, P, S, Cl, K, Ca, Br, I), 
    charged and uncharged molecules, and a wide range of covalent and non-covalent interactions.
    It provides both forces and energies calculated at the ωB97M-D3(BJ)/def2-TZVPPD level of theory,
    using Psi4 1.4.1 along with other useful quantities such as multipole moments and bond orders.

    Reference:
    Eastman, P., Behara, P.K., Dotson, D.L. et al. SPICE,
    A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials.
    Sci Data 10, 11 (2023). https://doi.org/10.1038/s41597-022-01882-6

    Dataset DOI:
    https://doi.org/10.5281/zenodo.8222043"

atomic_self_energies:
  H: -1576.5513678678228*kilojoule_per_mole
  Li: -19221.76009670645*kilojoule_per_mole
  C: -100114.38959681295*kilojoule_per_mole
  N: -143829.94579288512*kilojoule_per_mole
  O: -197627.70305727186*kilojoule_per_mole
  F: -262291.3177197502*kilojoule_per_mole
  Na: -425714.1444283384*kilojoule_per_mole
  Mg: -523447.29044746497*kilojoule_per_mole
  P: -896460.9044578229*kilojoule_per_mole
  S: -1045607.5830439369*kilojoule_per_mole
  Cl: -1208414.168327362*kilojoule_per_mole
  K: -1574847.955709633*kilojoule_per_mole
  Ca: -1777543.2887296947*kilojoule_per_mole
  Br: -6758454.442850963*kilojoule_per_mole
  I: -781842.6578771132*kilojoule_per_mole

full_dataset_v1.1:
  about: 'This provides a curated hdf5 file for the SPICE1 dataset designed to be
    compatible with modelforge. This dataset contains 19238 unique records for 1110165
    total configurations. '
  available_properties:
  - atomic_numbers
  - positions
  - total_charge
  - dft_total_energy
  - dft_total_force
  - formation_energy
  - mbis_charges
  - mbis_dipoles
  - mbis_quadrupoles
  - mbis_octupoles
  - scf_dipole
  - scf_quadrupole
  hdf5_schema: 2
  remote_dataset:
    doi: 10.5281/zenodo.15455029
    gz_data_file:
      file_name: spice_1_dataset_v1.1.hdf5.gz
      length: 10843511030
      md5: da9f4902d64fe957e1f3dd0a6a2463d0
    hdf5_data_file:
      file_name: spice_1_dataset_v1.1.hdf5
      md5: ff7583aeafd7991bc7c04505a5c0ee02
    url: https://zenodo.org/records/15455029/files/spice_1_dataset_v1.1.hdf5.gz

nc_1000_v1.1:
  about: 'This provides a curated hdf5 file for the SPICE1 dataset designed to be
    compatible with modelforge. This dataset contains 100 unique records for 1000
    total configurations, with a maximum of 10 configurations per record. '
  available_properties:
  - atomic_numbers
  - positions
  - total_charge
  - dft_total_energy
  - dft_total_force
  - formation_energy
  - mbis_charges
  - mbis_dipoles
  - mbis_quadrupoles
  - mbis_octupoles
  - scf_dipole
  - scf_quadrupole
  hdf5_schema: 2
  remote_dataset:
    url: https://zenodo.org/records/15461060/files/spice_1_dataset_v1.1_ntc_1000.hdf5.gz
    doi: 10.5281/zenodo.15461060
    gz_data_file:
      file_name: spice_1_dataset_v1.1_ntc_1000.hdf5.gz
      length: 14716353
      md5: 4d2dc9fd7b498f5f6bba6e8f4a5ffcfc
    hdf5_data_file:
      file_name: spice_1_dataset_v1.1_ntc_1000.hdf5
      md5: 193f675e419bdf92883e1c4606b240c5

nc_1000_HCNOFClS_v1.1:
  about: 'This provides a curated hdf5 file for a subset of the SPICE1 dataset designed
    to be compatible with modelforge. This dataset contains 100 unique records for
    1000 total configurations, with a maximum of 10 configurations per record. The
    dataset is limited to the elements that are compatible with ANI2x: [H, C,
    N, O, F, Cl, S]'
  available_properties:
  - atomic_numbers
  - positions
  - total_charge
  - dft_total_energy
  - dft_total_force
  - formation_energy
  - mbis_charges
  - mbis_dipoles
  - mbis_quadrupoles
  - mbis_octupoles
  - scf_dipole
  - scf_quadrupole
  hdf5_schema: 2
  remote_dataset:
    doi: 10.5281/zenodo.15461463
    gz_data_file:
      file_name: spice_1_dataset_v1.1_ntc_1000_HCNOFClS.hdf5.gz
      length: 14328739
      md5: 5a7e76f17694b1eb7f6368230589b586
    hdf5_data_file:
      file_name: spice_1_dataset_v1.1_ntc_1000_HCNOFClS.hdf5
      md5: 3864fe361fdaa178a9582b8e6afff5c4
    url: https://zenodo.org/records/15461463/files/spice_1_dataset_v1.1_ntc_1000_HCNOFClS.hdf5.gz

full_dataset_HCNOFClS_v1.1:
  about: 'This provides a curated hdf5 file for the SPICE1 dataset designed to be
    compatible with modelforge. This dataset contains 16565 unique records for 976408
    total configurations. The dataset is limited to the elements that are compatible
    with ANI2x: [H, C, N, O, F, Cl, S].'
  available_properties:
  - atomic_numbers
  - positions
  - total_charge
  - dft_total_energy
  - dft_total_force
  - formation_energy
  - mbis_charges
  - mbis_dipoles
  - mbis_quadrupoles
  - mbis_octupoles
  - scf_dipole
  - scf_quadrupole
  hdf5_schema: 2
  remote_dataset:
    doi: 10.5281/zenodo.15461488
    gz_data_file:
      file_name: spice_1_dataset_v1.1_HCNOFClS.hdf5.gz
      length: 9712216033
      md5: 961a71318f1bf6ef7e5d042d7aa4fc1c
    hdf5_data_file:
      file_name: spice_1_dataset_v1.1_HCNOFClS.hdf5
      md5: 28e1c57d6552be297e2accb3c35172c0
    url: https://zenodo.org/records/15461488/files/spice_1_dataset_v1.1_HCNOFClS.hdf5.gz

SPICE 1 OpenFF (spice1_openff): The full SPICE 1 OpenFF dataset is a subset of the SPICE 1 dataset, and includes 18782 unique records for 1106949 total configurations for 14 different elements (H, Li, C, N, O, F, Na, Mg, P, S, Cl, K, Ca, Br). All QM datapoints were generated using B3LYP-D3BJ/DZVP level of theory as this is the default theory used for force field development by the Open Force Field Initiative. was generated using ωB97M-D3(BJ)/def2-TZVPPD level of theory.

SPICE 1 OpenFF Dataset yaml Metadata

dataset: spice1_openff
latest: full_dataset_v2.1
latest_test: nc_1000_v2.1

description: "Small-molecule/Protein Interaction Chemical Energies (SPICE), calculated at the default OpenFF level of theory.
  The SPICE dataset contains conformations for a diverse set of small molecules, dimers, dipeptides, and solvated amino acids
  For 14 different elements (H, Li, C, N, O, F, Na, Mg, P, S, Cl, K, Ca, Br) in both charged and uncharged molecules. 
  Note, the original SPICE 1  dataset also includes Iodine (I), but systems with this element are not included in this
  dataset, as a small subset of the SPICE1 dataset was not included in the OpenFF dataset, namely: 
  
      -SPICE Ion Pairs Single Points Dataset v1.1
      -SPICE DES370K Single Points Dataset Supplement v1.0
  
  and a subset of calculations were not able to be converged fully. 
  
  The full SPICE 1 OpenFF dataset includes 18782 unique records for 1106949 total configurations while the original 
  SPICE 1 dataset includes 19238 unique records for 1110165 configurations 
  (both excluding those with forces > 1 hartree/bohr).
  
  All QM datapoints retrieved were generated using B3LYP-D3BJ/DZVP level of theory as this is the default theory used 
  for force field development by the Open Force Field Initiative; the original SPICE 1 dataset
  was generated using ωB97M-D3(BJ)/def2-TZVPPD level of theory.

  Reference to the original SPICE 1 dataset publication:
    Eastman, P., Behara, P.K., Dotson, D.L. et al. SPICE,
    A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials.
    Sci Data 10, 11 (2023). https://doi.org/10.1038/s41597-022-01882-6

  DOI to original SPICE 1 dataset (not at the OpenFF level of theory):
  https://doi.org/10.5281/zenodo.8222043"

atomic_self_energies:
  H: -1581.5384137007973*kilojoule_per_mole
  Li: -19322.614940687432*kilojoule_per_mole
  C: -100058.83756708907*kilojoule_per_mole
  N: -143747.52575867812*kilojoule_per_mole
  O: -197522.95360021706*kilojoule_per_mole
  F: -262187.61306363455*kilojoule_per_mole
  Na: -425595.8497308719*kilojoule_per_mole
  Mg: -523296.52031790506*kilojoule_per_mole
  P: -896206.0276563794*kilojoule_per_mole
  S: -1045356.0997863387*kilojoule_per_mole
  Cl: -1208153.2961282134*kilojoule_per_mole
  K: -1574540.1515238197*kilojoule_per_mole
  Ca: -1777205.6941588672*kilojoule_per_mole
  Br: -6757224.691339369*kilojoule_per_mole


full_dataset_v2.1:
  about: 'This provides a curated hdf5 file for the SPICE1 openff dataset designed
    to be compatible with modelforge. This dataset contains 18782 unique records for
    1106949 total configurations. 
    This excludes any configurations where the magnitude of any forces on the atoms are greater than 1 hartree/bohr.'
  available_properties:
  - atomic_numbers
  - positions
  - total_charge
  - dft_energy
  - dispersion_correction_energy
  - dft_total_energy
  - dft_force
  - dispersion_correction_force
  - dft_total_force
  - mbis_charges
  - scf_dipole
  hdf5_schema: 2
  remote_dataset:
    doi: 10.5281/zenodo.15475919
    gz_data_file:
      file_name: spice_1_openff_dataset_v2.1.hdf5.gz
      length: 3373088790
      md5: 25bc8d0bdf77a6667a26964a09e082c7
    hdf5_data_file:
      file_name: spice_1_openff_dataset_v2.1.hdf5
      md5: 65ded4727fb49a1fba6f7224e5cf43ec
    url: https://zenodo.org/records/15475919/files/spice_1_openff_dataset_v2.1.hdf5.gz

nc_1000_v2.1:
  about: 'This provides a curated hdf5 file for the SPICE1 openff dataset designed
    to be compatible with modelforge. This dataset contains 100 unique records for
    1000 total configurations, with a maximum of 10 configurations per record.
    This excludes any configurations where the magnitude of any forces on the atoms are greater than 1 hartree/bohr.'
  available_properties:
  - atomic_numbers
  - positions
  - total_charge
  - dft_energy
  - dispersion_correction_energy
  - dft_total_energy
  - dft_force
  - dispersion_correction_force
  - dft_total_force
  - mbis_charges
  - scf_dipole
  hdf5_schema: 2
  remote_dataset:
    doi: 10.5281/zenodo.15448194
    gz_data_file:
      file_name: spice_1_openff_dataset_v2.1_ntc_1000.hdf5.gz
      length: 5770925
      md5: 8b2729a28aa947576e485566926498bb
    hdf5_data_file:
      file_name: spice_1_openff_dataset_v2.1_ntc_1000.hdf5
      md5: 6187fcc7ff5d95e6608beecb09de9e77
    url: https://zenodo.org/records/15448194/files/spice_1_openff_dataset_v2.1_ntc_1000.hdf5.gz

nc_1000_HCNOFClS_v2.1:
  about: 'This provides a curated hdf5 file for a subset of the SPICE1 openff dataset
    designed to be compatible with modelforge. This dataset contains 100 unique records
    for 1000 total configurations, with a maximum of 10 configurations per record.
    The dataset is limited to the elements that are compatible with ANI2x NNP: [H,
    C, N, O, F, Cl, S]. This excludes any configurations where the magnitude of any forces 
    on the atoms are greater than 1 hartree/bohr.'
  available_properties:
  - atomic_numbers
  - positions
  - total_charge
  - dft_energy
  - dispersion_correction_energy
  - dft_total_energy
  - dft_force
  - dispersion_correction_force
  - dft_total_force
  - mbis_charges
  - scf_dipole
  hdf5_schema: 2
  remote_dataset:
    doi: 10.5281/zenodo.15476628
    gz_data_file:
      file_name: spice_1_openff_dataset_v2.1_ntc_1000_HCNOFClS.hdf5.gz
      length: 5770934
      md5: 7d9f626252e30dd6902da62b58505799
    hdf5_data_file:
      file_name: spice_1_openff_dataset_v2.1_ntc_1000_HCNOFClS.hdf5
      md5: 6187fcc7ff5d95e6608beecb09de9e77
    url: https://zenodo.org/records/15476628/files/spice_1_openff_dataset_v2.1_ntc_1000_HCNOFClS.hdf5.gz

full_dataset_HCNOFClS_v2.1:
  about: 'This provides a curated hdf5 file for the SPICE1 openff dataset designed
    to be compatible with modelforge. This dataset contains 16560 unique records for
    996941 total configurations. The dataset is limited to the elements that are compatible with ANI2x NNP: 
    [H, C, N, O, F, Cl, S]. This excludes any configurations where the magnitude of any forces 
    on the atoms are greater than 1 hartree/bohr.'
  available_properties:
  - atomic_numbers
  - positions
  - total_charge
  - dft_energy
  - dispersion_correction_energy
  - dft_total_energy
  - dft_force
  - dispersion_correction_force
  - dft_total_force
  - mbis_charges
  - scf_dipole
  hdf5_schema: 2
  remote_dataset:
    doi: 10.5281/zenodo.15476646
    gz_data_file:
      file_name: spice_1_openff_dataset_v2.1_HCNOFClS.hdf5.gz
      length: 3056052148
      md5: 172fa98f0abdaaf1a9b64812dc70cd81
    hdf5_data_file:
      file_name: spice_1_openff_dataset_v2.1_HCNOFClS.hdf5
      md5: 5d8a9e7b005f14627eea779630446430
    url: https://zenodo.org/records/15476646/files/spice_1_openff_dataset_v2.1_HCNOFClS.hdf5.gz

SPICE 2 (spice2): The SPICE2 dataset contains conformations for a diverse set of small molecules, dimers, dipeptides, and solvated amino acids. It includes 17 elements (H, Li, B, C, N, O, F, Na, Mg, Si, P, S, Cl, K, Ca, Br, I), charged and uncharged molecules, and a wide range of covalent and non-covalent interactions. It provides both forces and energies calculated at the ωB97M-D3(BJ)/def2-TZVPPD level of theory, using Psi4 along with other useful quantities such as multipole moments and bond orders.

Eastman, P., Pritchard, B. P., Chodera, J. D., & Markland, T. E. Nutmeg and SPICE: models and data for biomolecular machine learning. Journal of chemical theory and computation, 20(19), 8583-8593 (2024). https://doi.org/10.1021/acs.jctc.4c00794

SPICE 2 Dataset yaml Metadata

dataset: spice2
latest: full_dataset_v1.1
latest_test: nc_1000_v1.1

desciption: "The SPICE2 dataset contains conformations for a diverse set of small molecules,
    dimers, dipeptides, and solvated amino acids. It includes 15 elements, charged and
    uncharged molecules, and a wide range of covalent and non-covalent interactions.
    It provides both forces and energies calculated at the ωB97M-D3(BJ)/def2-TZVPPD level of theory,
    using Psi4 along with other useful quantities such as multipole moments and bond orders.
    
    SPICE 2.0.1 zenodo release:
    https://zenodo.org/records/10835749
    
    SPICE 2 github repository:
    https://github.com/openmm/spice-dataset

    Reference to SPICE 2 publication:
    Eastman, P., Pritchard, B. P., Chodera, J. D., & Markland, T. E. 
    Nutmeg and SPICE: models and data for biomolecular machine learning. 
    Journal of chemical theory and computation, 20(19), 8583-8593 (2024).
    https://doi.org/10.1021/acs.jctc.4c00794
    
    Reference to original SPICE publication:
    Eastman, P., Behara, P.K., Dotson, D.L. et al. 
    SPICE, A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials.
    Sci Data 10, 11 (2023). https://doi.org/10.1038/s41597-022-01882-6
    "

atomic_self_energies:
  H: -1579.292447611522*kilojoule_per_mole
  Li: -19206.917958817387*kilojoule_per_mole
  B: -65569.16994509766*kilojoule_per_mole
  C: -100112.36246928677*kilojoule_per_mole
  N: -143837.09401820396*kilojoule_per_mole
  O: -197640.6741826767*kilojoule_per_mole
  F: -262292.74039535556*kilojoule_per_mole
  Na: -425700.2376810506*kilojoule_per_mole
  Mg: -523428.7340381498*kilojoule_per_mole
  Si: -760410.2547546176*kilojoule_per_mole
  P: -896470.875967255*kilojoule_per_mole
  S: -1045588.2785379887*kilojoule_per_mole
  Cl: -1208421.9382829564*kilojoule_per_mole
  K: -1574833.7856186344*kilojoule_per_mole
  Ca: -1777525.2197214952*kilojoule_per_mole
  Br: -6758450.475943951*kilojoule_per_mole
  I: -781827.0797396759*kilojoule_per_mole

full_dataset_v1.1:
  about: This provides a curated hdf5 file for the SPICE2 dataset designed to be compatible
    with modelforge. This dataset contains 113985 unique records for 2008126 total
    configurations.
  available_properties:
  - atomic_numbers
  - positions
  - total_charge
  - dft_total_energy
  - dft_total_force
  - formation_energy
  - mbis_charges
  - mbis_dipoles
  - mbis_quadrupoles
  - mbis_octupoles
  - scf_dipole
  - scf_quadrupole
  hdf5_schema: 2
  remote_dataset:
    doi: 10.5281/zenodo.15419983
    url: https://zenodo.org/records/15419983/files/spice_2_dataset_v1.1.hdf5.gz
    gz_data_file:
      file_name: spice_2_dataset_v1.1.hdf5.gz
      length: 25532396063
      md5: 5ccaba21944da5e3f86f19389767b48f
    hdf5_data_file:
      file_name: spice_2_dataset_v1.1.hdf5
      md5: e2628a66e1f8f428151bcdbbfb1c41a7

nc_1000_v1.1:
  about: This provides a curated hdf5 file for a subset of the SPICE2 dataset designed
    to be compatible with modelforge. This dataset contains 403 unique records for
    1000 total configurations, with a maximum of 10 configurations per record.
  available_properties:
  - atomic_numbers
  - positions
  - total_charge
  - dft_total_energy
  - dft_total_force
  - formation_energy
  - mbis_charges
  - mbis_dipoles
  - mbis_quadrupoles
  - mbis_octupoles
  - scf_dipole
  - scf_quadrupole
  hdf5_schema: 2
  remote_dataset:
    doi: 10.5281/zenodo.15421048
    url: https://zenodo.org/records/15421048/files/spice_2_dataset_v1.1_ntc_1000.hdf5.gz
    gz_data_file:
      file_name: spice_2_dataset_v1.1_ntc_1000.hdf5.gz
      length: 26204494
      md5: 6b62d3410bba634bd0709ef750c78ca4
    hdf5_data_file:
      file_name: spice_2_dataset_v1.1_ntc_1000.hdf5
      md5: bcba4cdbfb9225306a6bfed01c28b364

full_dataset_HCNOFClS_v1.1:
  about: 'This provides a curated hdf5 file for the SPICE2 dataset designed to be
    compatible with modelforge. This dataset contains 97279 unique records for 1620239
    total configurations. The dataset is limited to the elements that are compatible
    with the ANI2x NNP architecture: [H, C, N, O, F, Cl, S].'
  available_properties:
  - atomic_numbers
  - positions
  - total_charge
  - dft_total_energy
  - dft_total_force
  - formation_energy
  - mbis_charges
  - mbis_dipoles
  - mbis_quadrupoles
  - mbis_octupoles
  - scf_dipole
  - scf_quadrupole
  hdf5_schema: 2
  remote_dataset:
    doi: 10.5281/zenodo.15428145
    url: https://zenodo.org/records/15428145/files/spice_2_dataset_v1.1_HCNOFClS.hdf5.gz
    gz_data_file:
      file_name: spice_2_dataset_v1.1_HCNOFClS.hdf5.gz
      length: 21093537996
      md5: 37c512f923e8b7ba62b4a59740d190c7
    hdf5_data_file:
      file_name: spice_2_dataset_v1.1_HCNOFClS.hdf5
      md5: 178e4b6f1d812895ae23e3823f032ea9

nc_1000_HCNOFClS_v1.1:
  about: 'This provides a curated hdf5 file for a subset of the SPICE2 dataset designed
    to be compatible with modelforge. This dataset contains 374 unique records for
    1000 total configurations, with a maximum of 10 configurations per record. The
    dataset is limited to the elements that are compatible with ANI2x NNP architecture: 
    [H, C, N, O, F, Cl, S].'
  available_properties:
    - atomic_numbers
    - positions
    - total_charge
    - dft_total_energy
    - dft_total_force
    - formation_energy
    - mbis_charges
    - mbis_dipoles
    - mbis_quadrupoles
    - mbis_octupoles
    - scf_dipole
    - scf_quadrupole
  hdf5_schema: 2
  remote_dataset:
    url: https://zenodo.org/records/15429129/files/spice_2_dataset_v1.1_ntc_1000_HCNOFClS.hdf5.gz
    doi: 10.5281/zenodo.15429129
    gz_data_file:
      file_name: spice_2_dataset_v1.1_ntc_1000_HCNOFClS.hdf5.gz
      length: 26256317
      md5: 3e0d774ffe05c2f6653d4249bdf6de98
    hdf5_data_file:
      file_name: spice_2_dataset_v1.1_ntc_1000_HCNOFClS.hdf5
      md5: 1e9ef56d71de23223ed73d2e4db510f5

full_dataset_HCNOF_v1.1:
  about: This provides a curated hdf5 file for the SPICE2 dataset designed to be compatible
    with modelforge. This dataset contains 57037 unique records for 928073 total configurations.
    The dataset is limited to the elements ['H', 'C', 'N', 'O', 'F'].
  available_properties:
    - atomic_numbers
    - positions
    - total_charge
    - dft_total_energy
    - dft_total_force
    - formation_energy
    - mbis_charges
    - mbis_dipoles
    - mbis_quadrupoles
    - mbis_octupoles
    - scf_dipole
    - scf_quadrupole
  hdf5_schema: 2
  remote_dataset:
    doi: 10.5281/zenodo.15579090
    url: https://zenodo.org/records/15579090/files/spice_2_dataset_v1.1_HCNOF.hdf5.gz
    gz_data_file:
      file_name: spice_2_dataset_v1.1_HCNOF.hdf5.gz
      length: 11712091901
      md5: 56d109fdb026f8a1224dfff76abbd79a
    hdf5_data_file:
      file_name: spice_2_dataset_v1.1_HCNOF.hdf5
      md5: afa15240f07082517ec6155cc35093e7

SPICE 2 OpenFF (spice2_openff): The SPICE 2 OpenFF dataset is a subset of the SPICE 2 dataset, and includes 112628 unique records for 1971769 total configurations for 16 elements (H, Li, B, C, N, O, F, Na, Mg, Si, P, S, Cl, K, Ca, Br) in both charged and uncharged molecules, and a wide range of covalent and non-covalent interactions. All QM datapoints were generated using B3LYP-D3BJ/DZVP level of theory as this is the default theory used for force field development by the Open Force Field Initiative.

SPICE 2 OpenFF Dataset yaml Metadata

dataset: spice2_openff
latest: full_dataset_v1.1
latest_test: nc_1000_v1.1

description: "Small-molecule/Protein Interaction Chemical Energies (SPICE), calculated at the default OpenFF level of theory.
  SPICE 2 Openff includes 16 elements (H, Li, B, C, N, O, F, Na, Mg, Si, P, S, Cl, K, Ca, Br) in both 
  charged and uncharged molecules, and a wide range of covalent and non-covalent interactions. 
  Note, the original SPICE 2 dataset also includes Iodine (I), but systems with this element are not included in this
  dataset, as a small subset of the SPICE2 dataset was not included in the OpenFF dataset, namely: 
  
      -SPICE Ion Pairs Single Points Dataset v1.1
      -SPICE DES370K Single Points Dataset Supplement v1.0
  
  and a subset of calculations were not able to be converged fully. 
  
  The full SPICE 2 OpenFF dataset includes 112628 unique records for 1971769 total configurations while the 
  original SPICE 2 dataset includes 113985 unique records for 2008126 total configurations
  (both excluding those with forces > 1 hartree/bohr).
  
  All datapoints in the SPICE 2 OpenFF dataset  were generated using B3LYP-D3BJ/DZVP level of theory, as this is 
  the default theory used for force field development by the Open Force Field Initiative; the original SPICE 2 dataset
  was generated using ωB97M-D3(BJ)/def2-TZVPPD level of theory.
  

  Reference to original SPICE 2 publication:
    Eastman, P., Pritchard, B. P., Chodera, J. D., & Markland, T. E 
    Nutmeg and SPICE: models and data for biomolecular machine learning. 
    Journal of chemical theory and computation, 20(19), 8583-8593 (2024). 
    https://doi.org/10.1021/acs.jctc.4c00794
    
  Reference to the original SPICE 1 publication:
    Eastman, P., Behara, P.K., Dotson, D.L. et al. SPICE,
    A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials.
    Sci Data 10, 11 (2023). 
    https://doi.org/10.1038/s41597-022-01882-6

  DOI to original SPICE 1 and 2 datasets (not at the OpenFF level of theory):
  10.5281/zenodo.7258939"

atomic_self_energies: # these need to be replaced
  H: -1583.7235381559833*kilojoule_per_mole
  Li: -19347.515361076046*kilojoule_per_mole
  B: -65529.65768442646*kilojoule_per_mole
  C: -100057.69285084814*kilojoule_per_mole
  N: -143754.50034055635*kilojoule_per_mole
  O: -197534.06499229133*kilojoule_per_mole
  F: -262187.95257544337*kilojoule_per_mole
  Na: -425581.9250584926*kilojoule_per_mole
  Mg: -523280.8560382089*kilojoule_per_mole
  Si: -760216.8701576393*kilojoule_per_mole
  P: -896215.2333703283*kilojoule_per_mole
  S: -1045334.9682307948*kilojoule_per_mole
  Cl: -1208159.74701762*kilojoule_per_mole
  K: -1574526.5435217225*kilojoule_per_mole
  Ca: -1777190.5165390158*kilojoule_per_mole
  Br: -6757220.35723592*kilojoule_per_mole,


nc_1000_v1.1:
  about: 'This provides a curated hdf5 file for the SPICE2 openff dataset designed
    to be compatible with modelforge. This dataset contains 100 unique records for
    1000 total configurations, with a maximum of 10 configurations per record.         
    This excludes any configurations where the magnitude of any forces on the
    atoms are greater than 1 hartree/bohr.'
  available_properties:
  - atomic_numbers
  - positions
  - total_charge
  - dft_energy
  - dispersion_correction_energy
  - dft_total_energy
  - dft_force
  - dispersion_correction_force
  - dft_total_force
  - mbis_charges
  - scf_dipole
  hdf5_schema: 2
  remote_dataset:
    doi: 10.5281/zenodo.15476920
    gz_data_file:
      file_name: spice_2_openff_dataset_v1.1_ntc_1000.hdf5.gz
      length: 5717857
      md5: 957c1d89fb698d8e195eaf2b7bca2362
    hdf5_data_file:
      file_name: spice_2_openff_dataset_v1.1_ntc_1000.hdf5
      md5: 51f0f237be809c764db2585d9a541be6
    url: https://zenodo.org/records/15476920/files/spice_2_openff_dataset_v1.1_ntc_1000.hdf5.gz

nc_1000_HCNOFClS_v1.1:
  about: 'This provides a curated hdf5 file for a subset of the SPICE2 openff dataset
    designed to be compatible with modelforge. This dataset contains 100 unique records
    for 1000 total configurations, with a maximum of 10 configurations per record.
    This excludes any configurations where the magnitude of any forces on the atoms are greater than 1 hartree/bohr.
    The dataset is limited to the elements that are compatible with ANI2x NNP: [H, C, N, O, F, Cl, S]'
  available_properties:
  - atomic_numbers
  - positions
  - total_charge
  - dft_energy
  - dispersion_correction_energy
  - dft_total_energy
  - dft_force
  - dispersion_correction_force
  - dft_total_force
  - mbis_charges
  - scf_dipole
  hdf5_schema: 2
  remote_dataset:
    doi: 10.5281/zenodo.15477012
    gz_data_file:
      file_name: spice_2_openff_dataset_v1.1_ntc_1000_HCNOFClS.hdf5.gz
      length: 5717866
      md5: ae8cc23e52692ff90b90378d6a7ea226
    hdf5_data_file:
      file_name: spice_2_openff_dataset_v1.1_ntc_1000_HCNOFClS.hdf5
      md5: 51f0f237be809c764db2585d9a541be6
    url: https://zenodo.org/records/15477012/files/spice_2_openff_dataset_v1.1_ntc_1000_HCNOFClS.hdf5.gz

full_dataset_HCNOFClS_v1.1:
  about: 'This provides a curated hdf5 file for the SPICE2 openff dataset designed
    to be compatible with modelforge. This dataset contains 97274 unique records for
    1620018 total configurations. This excludes any configurations where the 
    magnitude of any forces on the atoms are greater than 1 hartree/bohr.
    The dataset is limited to the elements that are compatible with ANI2x NNP: [H, C, N, O, F, Cl, S]'
  available_properties:
  - atomic_numbers
  - positions
  - total_charge
  - dft_energy
  - dispersion_correction_energy
  - dft_total_energy
  - dft_force
  - dispersion_correction_force
  - dft_total_force
  - mbis_charges
  - scf_dipole
  hdf5_schema: 2
  remote_dataset:
    doi: 10.5281/zenodo.15477069
    gz_data_file:
      file_name: spice_2_openff_dataset_v1.1_HCNOFClS.hdf5.gz
      length: 5912909189
      md5: dcae9a6853e964ec4d0d3f0201babbce
    hdf5_data_file:
      file_name: spice_2_openff_dataset_v1.1_HCNOFClS.hdf5
      md5: 75839044d1d5e846bd92c8035323ef69
    url: https://zenodo.org/records/15477069/files/spice_2_openff_dataset_v1.1_HCNOFClS.hdf5.gz

full_dataset_v1.1:
  about: 'This provides a curated hdf5 file for the SPICE2 openff dataset designed
    to be compatible with modelforge. This dataset contains 112628 unique records
    for 1971769 total configurations. This excludes any configurations where
    the magnitude of any forces on the atoms are greater than 1 hartree/bohr.
    '
  available_properties:
  - atomic_numbers
  - positions
  - total_charge
  - dft_energy
  - dispersion_correction_energy
  - dft_total_energy
  - dft_force
  - dispersion_correction_force
  - dft_total_force
  - mbis_charges
  - scf_dipole
  hdf5_schema: 2
  remote_dataset:
    doi: 10.5281/zenodo.15477437
    gz_data_file:
      file_name: spice_2_openff_dataset_v1.1.hdf5.gz
      length: 7133416023
      md5: 5896c787315a6473df14db219b7ce5ef
    hdf5_data_file:
      file_name: spice_2_openff_dataset_v1.1.hdf5
      md5: 7cb903f659447fb838083fd523d14e0a
    url: https://zenodo.org/records/15477437/files/spice_2_openff_dataset_v1.1.hdf5.gz

tmQM (tmqm): The tmQM dataset contains the geometries and properties of 108,541 mononuclear complexes extracted from the Cambridge Structural Database, including Werner, bioinorganic, and organometallic complexes based on a large variety of organic ligands and 30 transition metals (the 3d, 4d, and 5d from groups 3 to 12). All complexes are closed-shell, with a formal charge in the range {+1, 0, −1}e

David Balcells and Bastian Bjerkem Skjelstad, tmQM Dataset—Quantum Geometries and Properties of 86k Transition Metal Complexes. Journal of Chemical Information and Modeling 2020 60 (12), 6135-6146 https://dx.doi.org/10.1021/acs.jcim.0c01041”

tmQM Dataset yaml Metadata

dataset: tmqm
latest: full_dataset_v1.1
latest_test: nc_1000_v1.1

description: "The tmQM dataset contains the geometries and properties of 108,541 (86,665 in the original published set) 
    mononuclear complexes extracted from the Cambridge Structural Database, including Werner, bioinorganic, 
    and organometallic complexes based on a large variety of organic ligands and 30 transition metals 
    (the 3d, 4d, and 5d from groups 3 to 12). 
    
    All complexes are closed-shell, with a formal charge in the range {+1, 0, −1}e

    Original Citation:

    David Balcells and Bastian Bjerkem Skjelstad,
    tmQM Dataset—Quantum Geometries and Properties of 86k Transition Metal Complexes
    Journal of Chemical Information and Modeling 2020 60 (12), 6135-6146
    DOI: 10.1021/acs.jcim.0c01041"

atomic_self_energies:
  H: -1588.690123425219 * kilojoule_per_mole
  B: -65302.33351128112 * kilojoule_per_mole
  C: -100005.01654855655 * kilojoule_per_mole
  N: -143654.56892638578 * kilojoule_per_mole
  O: -197361.76171021158 * kilojoule_per_mole
  F: -261926.20424903592 * kilojoule_per_mole
  Si: -760035.7764038445 * kilojoule_per_mole
  P: -896075.6280215026 * kilojoule_per_mole
  S: -1045229.0663264447 * kilojoule_per_mole
  Cl: -1208038.6914349555 * kilojoule_per_mole
  Sc: -1997181.018901612 * kilojoule_per_mole
  Ti: -2230278.4864245243 * kilojoule_per_mole
  V: -2478389.354471244 * kilojoule_per_mole
  Cr: -2741967.2994972193 * kilojoule_per_mole
  Mn: -3021546.098466564 * kilojoule_per_mole
  Fe: -3317395.5973328506 * kilojoule_per_mole
  Co: -3629935.0938135427 * kilojoule_per_mole
  Ni: -3959571.3270608196 * kilojoule_per_mole
  Cu: -4306402.576897981 * kilojoule_per_mole
  Zn: -4671113.922983311 * kilojoule_per_mole
  As: -5869526.931994888 * kilojoule_per_mole
  Se: -6304454.897949699 * kilojoule_per_mole
  Br: -6757541.6132786125 * kilojoule_per_mole
  Y: -100773.37555590154 * kilojoule_per_mole
  Zr: -123709.71011983423 * kilojoule_per_mole
  Nb: -149762.5718722473 * kilojoule_per_mole
  Mo: -179149.19860244964 * kilojoule_per_mole
  Tc: -212135.93903845942 * kilojoule_per_mole
  Ru: -248990.05884762504 * kilojoule_per_mole
  Rh: -290061.85478664236 * kilojoule_per_mole
  Pd: -335541.5978772224 * kilojoule_per_mole
  Ag: -385322.4473000328 * kilojoule_per_mole
  Cd: -440036.922094555 * kilojoule_per_mole
  I: -781538.4859057926 * kilojoule_per_mole
  La: -82991.16536291114 * kilojoule_per_mole
  Hf: -126278.27589562583 * kilojoule_per_mole
  Ta: -149779.8577882084 * kilojoule_per_mole
  W: -176248.83619057274 * kilojoule_per_mole
  Re: -205660.78711122595 * kilojoule_per_mole
  Os: -237930.364964837 * kilojoule_per_mole
  Ir: -273916.6051998535 * kilojoule_per_mole
  Pt: -313193.3625088413 * kilojoule_per_mole
  Au: -356039.1883931112 * kilojoule_per_mole
  Hg: -402551.5785347049 * kilojoule_per_mole

full_dataset_v1.1:
  hdf5_schema: 2
  available_properties:
    - atomic_numbers
    - positions
    - partial_charges
    - total_Charge
    - spin_multiplicities
    - electronic_energy
    - dispersion_energy
    - total_energy
    - dipole_moment_magnitude
    - dipole_moment_computed
    - dipole_moment_computed_scaled
    - energy_of_lumo
    - energy_of_homo
    - homo_lumo_gap
  about:
    "This dataset contains 108541 unique systems with 108541 total configurations (1 configuration per system).
     The original tmQM repository (https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds 
     to the data committed on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13)."
  remote_dataset:
    doi: 10.5281/zenodo.15331686
    url: https://zenodo.org/records/15331686/files/tmqm_dataset_v1.1.hdf5.gz
    gz_data_file:
      length: 328729880
      md5: 16332938d98f71023a963b36e7ae2191
      file_name: tmqm_dataset_v1.1.hdf5.gz
    hdf5_data_file:
      md5: b9454280213b488081955d46ebb378eb
      file_name: tmqm_dataset_v1.1.hdf5

nc_1000_v1.1:
  hdf5_schema: 2
  available_properties:
    - atomic_numbers
    - positions
    - partial_charges
    - total_Charge
    - spin_multiplicities
    - electronic_energy
    - dispersion_energy
    - total_energy
    - dipole_moment_magnitude
    - dipole_moment_computed
    - dipole_moment_computed_scaled
    - energy_of_lumo
    - energy_of_homo
    - homo_lumo_gap
  about:
    "This dataset contains 108541 unique systems with 108541 total configurations (1 configuration per system).
     The original tmQM repository (https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds 
     to the data committed on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13)."
  remote_dataset:
    doi: 10.5281/zenodo.15331808
    url: https://zenodo.org/records/15331808/files/tmqm_dataset_v1.1_ntc_1000.hdf5.gz
    gz_data_file:
      length: 3605547
      md5: d9bfdccaa16b440d5c95c90be6480dab
      file_name: tmqm_dataset_nc_1000_v1.1.hdf5.gz
    hdf5_data_file:
      md5: d7fe366ecf8fee0264fcfac17e0b0b87
      file_name: tmqm_dataset_nc_1000_v1.1.hdf5

PdZnFeCu_CHPSONFClBr_nc_1000_v1.1:
  hdf5_schema: 2
  available_properties:
    - atomic_numbers
    - positions
    - partial_charges
    - total_Charge
    - spin_multiplicities
    - electronic_energy
    - dispersion_energy
    - total_energy
    - dipole_moment_magnitude
    - dipole_moment_computed
    - dipole_moment_computed_scaled
    - energy_of_lumo
    - energy_of_homo
    - homo_lumo_gap
  about:
    "This dataset contains 1000 unique systems with 1000 total configurations (1 configuration per system).
    This is a 1000 conformer test system.
     
     This dataset is limited to systems that contain transition metals Pd, Zn, Fe, or Cu, 
     and also only contain elements C, H, P, S, O, N, F, Cl, or Br.
     
     The original tmQM repository (https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds 
     to the data committed on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13)."
  remote_dataset:
    doi: 10.5281/zenodo.15345571
    url: https://zenodo.org/records/15345571/files/tmqm_dataset_PdZnFeCu_CHPSONFClBr_ntc_1000_v1.1.hdf5.gz
    gz_data_file:
      length: 3137141
      md5: 672025a093b4adf3233186be1c8393f4
      file_name: tmqm_dataset_PdZnFeCu_CHPSONFClBr_nc_1000_v1.1.hdf5.gz
    hdf5_data_file:
      md5: 531977896b6074912098a0ec36baa665
      file_name: tmqm_dataset_PdZnFeCu_CHPSONFClBr_nc_1000_v1.1.hdf5


PdZnFeCu_CHPSONFClBr_v1.1:
  hdf5_schema: 2
  available_properties:
    - atomic_numbers
    - positions
    - partial_charges
    - total_Charge
    - spin_multiplicities
    - electronic_energy
    - dispersion_energy
    - total_energy
    - dipole_moment_magnitude
    - dipole_moment_computed
    - dipole_moment_computed_scaled
    - energy_of_lumo
    - energy_of_homo
    - homo_lumo_gap
  about:
    "This dataset contains 23183 unique systems with 23183 total configurations (1 configuration per system).
     
     This dataset is limited to systems that contain transition metals Pd, Zn, Fe, or Cu, 
     and also only contain elements C, H, P, S, O, N, F, Cl, or Br.
     
     The original tmQM repository (https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds 
     to the data committed on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13)."
  remote_dataset:
    doi: 10.5281/zenodo.15345127
    url: https://zenodo.org/records/15345127/files/tmqm_dataset_PdZnFeCu_CHPSONFClBr_v1.1.hdf5.gz
    gz_data_file:
      length: 69495572
      md5: ebd0ab2f6ab569980e2a2ce5c273146f
      file_name: tmqm_dataset_PdZnFeCu_CHPSONFClBr_v1.1.hdf5.gz
    hdf5_data_file:
      md5: 4a0127a5c12c8c8a88c87eba058821ed
      file_name: tmqm_dataset_PdZnFeCu_CHPSONFClBr_v1.1.hdf5

PdZnFeCuNiPtIrRhCrAg_CHPSONFClBr_v1.1:
  hdf5_schema: 2
  available_properties:
    - atomic_numbers
    - positions
    - partial_charges
    - total_Charge
    - spin_multiplicities
    - electronic_energy
    - dispersion_energy
    - total_energy
    - dipole_moment_magnitude
    - dipole_moment_computed
    - dipole_moment_computed_scaled
    - energy_of_lumo
    - energy_of_homo
    - homo_lumo_gap
  about:
    "This dataset contains 51258 unique systems with 51258 total configurations (1 configuration per system).
     
     This dataset is limited to systems that contain transition metals Pd, Zn, Fe, Cu, Ni, Pt, Ir, Rh, Cr, or Ag  
     and also only contain elements C, H, P, S, O, N, F, Cl, or Br.
     
     The original tmQM repository (https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds 
     to the data committed on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13)."
  remote_dataset:
    doi: 10.5281/zenodo.15345149
    url: https://zenodo.org/records/15345149/files/tmqm_dataset_PdZnFeCuNiPtIrRhCrAg_CHPSONFClBr_v1.1.hdf5.gz
    gz_data_file:
      length: 153375735
      md5: 4add21b5612b9132d7ce4b7d232dfe3e
      file_name: tmqm_dataset_PdZnFeCuNiPtIrRhCrAg_CHPSONFClBr_v1.1.hdf5.gz
    hdf5_data_file:
      md5: b0702a256be4f5cf581441ac35c7d7af
      file_name: tmqm_dataset_PdZnFeCuNiPtIrRhCrAg_CHPSONFClBr_v1.1.hdf5

tmQM-xtb (tmqm_xtb): The tmQM-xtb dataset include configurations generated using GFN2-xTB-based MD simulations starting from the energy-minimized geometries in the tmQM dataset. Energies, forces, charges, and dipole moments were calculated using the GFN2-xTB method. Several variants of the dataset are available, generated using different temperatures for MD sampling.

tmQM-xtb Dataset yaml Metadata

dataset: tmqm_xtb
latest: PdZnFeCuNiPtIrRhCrAg_T100K_v1.1
latest_test: nc_1000_v1.1

description: "The tmQM-xtb dataset performs GFN2-xTB-based MD simulations starting from the energy-minimized geometries 
    in the tmQM dataset.  

    The original tmQM dataset contains the geometries and properties of mononuclear complexes extracted from the
    Cambridge Structural Database, including Werner, bioinorganic, and organometallic complexes based on a large
    variety of organic ligands and 30 transition metals (the 3d, 4d, and 5d from groups 3 to 12).
    All complexes are closed-shell, with a formal charge in the range {+1, 0, −1}e .

    Original Citation:

    David Balcells and Bastian Bjerkem Skjelstad,
    tmQM Dataset—Quantum Geometries and Properties of 86k Transition Metal Complexes
    Journal of Chemical Information and Modeling 2020 60 (12), 6135-6146
    DOI: 10.1021/acs.jcim.0c01041 "

atomic_self_energies:
  H: -1346.9991827591664 * kilojoule_per_mole
  C: -5617.968751828634 * kilojoule_per_mole
  N: -7672.109298341974 * kilojoule_per_mole
  O: -10704.649544039614 * kilojoule_per_mole
  F: -12450.413867238472 * kilojoule_per_mole
  Ir: -6598.040049917221 * kilojoule_per_mole
  Pt: -8576.086025878865 * kilojoule_per_mole
  P: -12100.053458428218 * kilojoule_per_mole
  S: -4944.219007863149 * kilojoule_per_mole
  Cl: -7938.35372876674 * kilojoule_per_mole
  Cr: -12369.173271985948 * kilojoule_per_mole
  Fe: -9663.693466916478 * kilojoule_per_mole
  Ni: -1252.3530347274261 * kilojoule_per_mole
  Cu: -10894.410447334463 * kilojoule_per_mole
  Zn: -10182.310751929233 * kilojoule_per_mole
  Br: -11739.997032286365 * kilojoule_per_mole
  Rh: -9590.608153082434 * kilojoule_per_mole
  Pd: -9713.417530536652 * kilojoule_per_mole
  Ag: -11641.150291664564 * kilojoule_per_mole

PdZnFeCu_T100K_single_config_v1.0:
  hdf5_schema: 2
  available_properties:
    - positions
    - atomic_numbers
    - total_charge
    - forces
    - dipole_moment_per_system
    - energies
    - partial_charges
  about:  "This dataset contains 23134 unique systems with 23134 total configurations (1 configuration per system)  
           Each configuration corresponds to the geometry distributed as part of the original tmQM dataset. with
           no MD sampling applied.
             
           This dataset is limited to systems that contain transition metals Pd, Zn, Fe,  or Cu, 
           and also only contain elements C, H, P, S, O, N, F, Cl, or Br. 
          
          This dataset was generated starting from the tmQM dataset; the original tmQM repository 
          (https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds to the data committed
          on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13).
          
          The tblite package was used to evaluate the energetic of the system using the gfn2-xtb formalism, 
          using the calculator as part of the Atomic Simulation Environment (ASE), calculated at accuracy level 1. 

          Scripts used to perform the sampling can be found at  https://github.com/chrisiacovella/xtb_config_gen"
  remote_dataset:
    doi: 10.5281/zenodo.15021819
    url: https://zenodo.org/records/15021819/files/tmqm_xtb_dataset_PdZnFeCu_T100_first_v1.0.hdf5.gz
    gz_data_file:
      length: 96544047
      md5: cb86823c62d2127c209cded323c03eef
      file_name: tmqm_xtb_dataset_PdZnFeCu_T100K_single_config_v1.hdf5.gz
    hdf5_data_file:
      md5: 96811817c3d65fdbe1c3691125ff0664
      file_name: tmqm_xtb_dataset_PdZnFeCu_T100K_single_config_v1.hdf5

nc_1000_v1.1:
  hdf5_schema: 2
  available_properties:
    - positions
    - atomic_numbers
    - total_charge
    - forces
    - dipole_moment_per_system
    - energies
    - partial_charges
  about: "This dataset contains 103 unique systems with 1000 total configurations (max of 10 configurations per system), 
          where MD sampling was performed at T=100K.  
          
          This dataset is limited to systems that contain transition metals Pd, Zn, Fe,  or Cu, and also only contain 
          elements C, H, P, S, O, N, F, Cl, or Br. 
          
          Potentially problematic configurations (i.e., unstable or those with structural changes) were removed. 
          Briefly, bond inference was performed on the initial configuration using RDKit and a configuration was 
          excluded if any of those bond distances changed by more than 0.15 angstroms compared to the initial, 
          energy minimized state.
          
          This dataset was generated starting from the tmQM dataset; the original tmQM repository 
          (https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds to the data committed
          on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13).
          
          Each system in the tmQM database was evaluated using gfn2-xtb, and then a short MD simulation performed to 
          provide additional configurations of the systems.
          
          - The tblite package was used to evaluate the energetic of the system using the gfn2-xtb formalism.
          - MD simulations were performed using the Atomic Simulation Environment (ASE), using the Langevin integrator.
          - Simulations were performed at 100K with a 1 fs timestep and 0.01 1/fs friction damping factor. 
          - In all trajectories, the first configuration corresponds to the energy minimized configuration reported 
            in the original tmQM dataset.
          - 100 steps were taken between snapshots (100 fs), with 10 total snapshots per system.
          - During MD sampling, gfn2-xtb accuracy was set to 2; all reported properties were calculated at 
            gfn2-xtb accuracy level 1.

          Scripts used to perform the sampling can be found https://github.com/chrisiacovella/xtb_config_gen"
  remote_dataset:
    doi: 10.5281/zenodo.15059379
    gz_data_file:
      length: 3425268
      md5: 43e80a303a9e02c47cc679ee8502cd11
      file_name: tmqm_xtb_dataset_PdZnFeCu_T100K_ntc_1000_v1.1.hdf5.gz
    hdf5_data_file:
      md5: 6c8676c119a4f0028b3cf9c7de5d577c
      file_name: tmqm_xtb_dataset_PdZnFeCu_T100K_ntc_1000_v1.1.hdf5
    url: https://zenodo.org/records/15059379/files/tmqm_xtb_dataset_PdZnFeCu_T100_ntc_1000_v1.1.hdf5.gz

PdZnFeCu_T100K_v1.1:
  hdf5_schema: 2
  available_properties:
    - positions
    - atomic_numbers
    - total_charge
    - forces
    - dipole_moment_per_system
    - energies
    - partial_charges
  about: "This dataset contains 23134 unique systems with 225068 total configurations, where MD sampling was performed 
          at T=100K. 
                    
          This dataset is limited to systems that contain transition metals Pd, Zn, Fe,  or Cu, and also only contain 
          elements C, H, P, S, O, N, F, Cl, or Br. 
          
          Potentially problematic configurations (i.e., unstable or those with structural changes) were removed. 
          Briefly, bond inference was performed on the initial configuration using RDKit and a configuration was 
          excluded if any of those bond distances changed by more than 0.15 angstroms compared to the initial, 
          energy minimized state.
          
          This dataset was generated starting from the tmQM dataset; the original tmQM repository 
          (https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds to the data committed
          on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13).
          
          Each system in the tmQM database was evaluated using gfn2-xtb, and then a short MD simulation performed to 
          provide additional configurations of the systems.
          
          - The tblite package was used to evaluate the energetic of the system using the gfn2-xtb formalism.
          - MD simulations were performed using the Atomic Simulation Environment (ASE), using the Langevin integrator.
          - Simulations were performed at 100K with a 1 fs timestep and 0.01 1/fs friction damping factor. 
          - In all trajectories, the first configuration corresponds to the energy minimized configuration reported 
            in the original tmQM dataset.
          - 100 steps were taken between snapshots (100 fs), with 10 total snapshots per system.
          - During MD sampling, gfn2-xtb accuracy was set to 2; all reported properties were calculated at 
            gfn2-xtb accuracy level 1.

          Scripts used to perform the sampling can be found https://github.com/chrisiacovella/xtb_config_gen"
  remote_dataset:
    doi: 10.5281/zenodo.15059433
    gz_data_file:
      length: 828124531
      md5: c7c8d48d7077dfbd10635a17ffa38848
      file_name: tmqm_xtb_dataset_PdZnFeCu_T100K_v1.1.hdf5.gz
    hdf5_data_file:
      md5: e121c9182a2c6621d9f92f8d4b4a8188
      file_name: tmqm_xtb_dataset_PdZnFeCu_T100K_v1.1.hdf5
    url: https://zenodo.org/records/15059433/files/tmqm_xtb_dataset_PdZnFeCu_T100_v1.1.hdf5.gz

PdZnFeCuNiPtIrRhCrAg_T100K_v1.1:
  hdf5_schema: 2
  available_properties:
    - positions
    - atomic_numbers
    - total_charge
    - forces
    - dipole_moment_per_system
    - energies
    - partial_charges
  about: "This dataset contains 51160 unique systems with 499087 total configurations, with MD sampling at T=100K.
  
          This dataset is limited to systems that contain transition metals Pd, Zn, Fe, Cu, Ni, Pt, Ir, Rh, Cr, or Ag 
          and also only contain elements C, H, P, S, O, N, F, Cl, or Br. 

          Potentially problematic configurations (i.e., unstable or those with structural changes) were removed. 
          Briefly, bond inference was performed on the initial configuration using RDKit and a configuration was 
          excluded if any of those bond distances changed by more than 0.15 angstroms compared to the initial, 
          energy minimized state.

          This dataset was generated starting from the tmQM dataset; the original tmQM repository 
          (https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds to the data committed
          on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13).

          Each system in the tmQM database was evaluated using gfn2-xtb, and then a short MD simulation performed to 
          provide additional configurations of the systems.

          - The tblite package was used to evaluate the energetic of the system using the gfn2-xtb formalism.
          - MD simulations were performed using the Atomic Simulation Environment (ASE), using the Langevin integrator.
          - Simulations were performed at 100K with a 1 fs timestep and 0.01 1/fs friction damping factor. 
          - In all trajectories, the first configuration corresponds to the energy minimized configuration reported 
            in the original tmQM dataset.
          - 100 steps were taken between snapshots (100 fs), with 10 total snapshots per system.
          - During MD sampling, gfn2-xtb accuracy was set to 2; all reported properties were calculated at 
            gfn2-xtb accuracy level 1.

          Scripts used to perform the sampling can be found https://github.com/chrisiacovella/xtb_config_gen"
  remote_dataset:
    doi: 10.5281/zenodo.15059465
    gz_data_file:
      length: 1829694005
      md5: 9efd03d7c18901b5618489db6209d0a0
      file_name: tmqm_xtb_dataset_PdZnFeCuNiPtIrRhCrAg_T100K_v1.1.hdf5.gz
    hdf5_data_file:
      md5: 16fa0b45afb7ff3ca9568cca54d89de0
      file_name: tmqm_xtb_dataset_PdZnFeCuNiPtIrRhCrAg_T100K_v1.1.hdf5
    url: https://zenodo.org/records/15059465/files/tmqm_xtb_dataset_PdZnFeCuNiPtIrCrAg_T100_v1.1.hdf5.gz

PdZnFeCuNiPtIrRhCrAg_T200K_v1.1:
  hdf5_schema: 2
  available_properties:
    - positions
    - atomic_numbers
    - total_charge
    - forces
    - dipole_moment_per_system
    - energies
    - partial_charges
  about: "This dataset contains 51249 unique systems with 1317625 total configurations, sampled at T=200K.

          This dataset is limited to systems that contain transition metals Pd, Zn, Fe, Cu, Ni, Pt, Ir, Rh, Cr, or Ag 
          and also only contain elements C, H, P, S, O, N, F, Cl, or Br. 

          Potentially problematic configurations (i.e., unstable or those with structural changes) were removed. 
          Briefly, bond inference was performed on the initial configuration using RDKit and a configuration was 
          excluded if any of those bond distances changed by more than 0.15 angstroms compared to the initial, 
          energy minimized state.

          This dataset was generated starting from the tmQM dataset; the original tmQM repository 
          (https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds to the data committed
          on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13).

          Each system in the tmQM database was evaluated using gfn2-xtb, and then a short MD simulation performed to 
          provide additional configurations of the systems.

          - The tblite package was used to evaluate the energetic of the system using the gfn2-xtb formalism.
          - MD simulations were performed using the Atomic Simulation Environment (ASE), using the Langevin integrator.
          - Simulations were performed at 200K with a 1 fs timestep and 0.01 1/fs friction damping factor. 
          - In all trajectories, the first configuration corresponds to the energy minimized configuration reported 
            in the original tmQM dataset.
          - 100 steps were taken between snapshots (100 fs), with 30 total snapshots per system.
          - During MD sampling, gfn2-xtb accuracy was set to 2; all reported properties were calculated at 
            gfn2-xtb accuracy level 1.

          Scripts used to perform the sampling can be found https://github.com/chrisiacovella/xtb_config_gen"
  remote_dataset:
    doi: 10.5281/zenodo.15226046
    gz_data_file:
      length: 4749276362
      md5: a1d03a025ecfd48d7dc286b3d71cb900
      file_name: tmqm_xtb_dataset_PdZnFeCuNiPtIrRhCrAg_T200K_v1.1.hdf5.gz
    hdf5_data_file:
      md5: 19203071e1ff743d3402a36750f74b86
      file_name: tmqm_xtb_dataset_PdZnFeCuNiPtIrRhCrAg_T200K_v1.1.hdf5
    url: https://zenodo.org/records/15226046/files/tmqm_xtb_dataset_PdZnFeCuNiPtIrCrAg_T200_v1.1.hdf5.gz

PdZnFeCuNiPtIrRhCrAg_T300K_v1.1:
  hdf5_schema: 2
  available_properties:
    - positions
    - atomic_numbers
    - total_charge
    - forces
    - dipole_moment_per_system
    - energies
    - partial_charges
  about: "This dataset contains 51252 unique systems with 1118541 total configurations, sampled at T=300K.

          This dataset is limited to systems that contain transition metals Pd, Zn, Fe, Cu, Ni, Pt, Ir, Rh, Cr, or Ag 
          and also only contain elements C, H, P, S, O, N, F, Cl, or Br. 

          Potentially problematic configurations (i.e., unstable or those with structural changes) were removed. 
          Briefly, bond inference was performed on the initial configuration using RDKit and a configuration was 
          excluded if any of those bond distances changed by more than 0.15 angstroms compared to the initial, 
          energy minimized state.

          This dataset was generated starting from the tmQM dataset; the original tmQM repository 
          (https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds to the data committed
          on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13).

          Each system in the tmQM database was evaluated using gfn2-xtb, and then a short MD simulation performed to 
          provide additional configurations of the systems.

          - The tblite package was used to evaluate the energetic of the system using the gfn2-xtb formalism.
          - MD simulations were performed using the Atomic Simulation Environment (ASE), using the Langevin integrator.
          - Simulations were performed at 300K with a 1 fs timestep and 0.01 1/fs friction damping factor. 
          - In all trajectories, the first configuration corresponds to the energy minimized configuration reported 
            in the original tmQM dataset.
          - 100 steps were taken between snapshots (100 fs), with 30 total snapshots per system.
          - During MD sampling, gfn2-xtb accuracy was set to 2; all reported properties were calculated at 
            gfn2-xtb accuracy level 1.

          Scripts used to perform the sampling can be found https://github.com/chrisiacovella/xtb_config_gen"
  remote_dataset:
    doi: 10.5281/zenodo.15226639
    gz_data_file:
      length: 4062452149
      md5: 5005e4b8c329031b14ceeef67cb67644
      file_name: tmqm_xtb_dataset_PdZnFeCuNiPtIrRhCrAg_T300K_v1.1.hdf5.gz
    hdf5_data_file:
      md5: 999454490fe077a88c409970504f7f41
      file_name: tmqm_xtb_dataset_PdZnFeCuNiPtIrRhCrAg_T300K_v1.1.hdf5
    url: https://zenodo.org/records/15226639/files/tmqm_xtb_dataset_PdZnFeCuNiPtIrCrAg_T300_v1.1.hdf5.gz

PdZnFeCu_T200K_v1.1:
  hdf5_schema: 2
  available_properties:
    - positions
    - atomic_numbers
    - total_charge
    - forces
    - dipole_moment_per_system
    - energies
    - partial_charges
  about: "This dataset contains  23175 unique systems with 584,935 total configurations, sampled at T=200K.

          This dataset is limited to systems that contain transition metals Pd, Zn, Fe, or Cu, 
          and also only contain elements C, H, P, S, O, N, F, Cl, or Br.
          
          Potentially problematic configurations (i.e., unstable or those with structural changes) were removed. 
          Briefly, bond inference was performed on the initial configuration using RDKit and a configuration was 
          excluded if any of those bond distances changed by more than 0.15 angstroms compared to the initial, 
          energy minimized state.

          This dataset was generated starting from the tmQM dataset; the original tmQM repository 
          (https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds to the data committed
          on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13).

          Each system in the tmQM database was evaluated using gfn2-xtb, and then a short MD simulation performed to 
          provide additional configurations of the systems.

          - The tblite package was used to evaluate the energetic of the system using the gfn2-xtb formalism.
          - MD simulations were performed using the Atomic Simulation Environment (ASE), using the Langevin integrator.
          - Simulations were performed at 200K with a 1 fs timestep and 0.01 1/fs friction damping factor. 
          - In all trajectories, the first configuration corresponds to the energy minimized configuration reported 
            in the original tmQM dataset.
          - 100 steps were taken between snapshots (100 fs), with 30 total snapshots per system.
          - During MD sampling, gfn2-xtb accuracy was set to 2; all reported properties were calculated at 
            gfn2-xtb accuracy level 1.

          Scripts used to perform the sampling can be found https://github.com/chrisiacovella/xtb_config_gen"
  remote_dataset:
    doi: 10.5281/zenodo.15227023
    gz_data_file:
      length: 2118955545
      md5: 834ec7ed3670dfaaacc78beccc4b8a8d
      file_name: tmqm_xtb_dataset_PdZnFeCu_T200K_v1.1.hdf5.gz
    hdf5_data_file:
      md5: 4cb6d3e170e5cb9c63e2cac58b84a33f
      file_name: tmqm_xtb_dataset_PdZnFeCu_T200K_v1.1.hdf5
    url: https://zenodo.org/records/15227023/files/tmqm_xtb_dataset_PdZnFeCu_T200_v1.1.hdf5.gz

PdZnFeCu_T200K_ncr10_v1.1:
  hdf5_schema: 2
  available_properties:
    - positions
    - atomic_numbers
    - total_charge
    - forces
    - dipole_moment_per_system
    - energies
    - partial_charges
  about: "This dataset contains 23175b unique systems with 230,030 total configurations (maximum of 10 per system), 
          sampled at T=200K. While 30 configurations were generated per system during sampling, this dataset limits 
          this to be a maximum of 10 configurations per system, to allow for more direct comparison with T=100K data.
          

          This dataset is limited to systems that contain transition metals Pd, Zn, Fe, or Cu, 
          and also only contain elements C, H, P, S, O, N, F, Cl, or Br.
          
          Potentially problematic configurations (i.e., unstable or those with structural changes) were removed. 
          Briefly, bond inference was performed on the initial configuration using RDKit and a configuration was 
          excluded if any of those bond distances changed by more than 0.15 angstroms compared to the initial, 
          energy minimized state.

          This dataset was generated starting from the tmQM dataset; the original tmQM repository 
          (https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds to the data committed
          on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13).

          Each system in the tmQM database was evaluated using gfn2-xtb, and then a short MD simulation performed to 
          provide additional configurations of the systems.

          - The tblite package was used to evaluate the energetic of the system using the gfn2-xtb formalism.
          - MD simulations were performed using the Atomic Simulation Environment (ASE), using the Langevin integrator.
          - Simulations were performed at 200K with a 1 fs timestep and 0.01 1/fs friction damping factor. 
          - In all trajectories, the first configuration corresponds to the energy minimized configuration reported 
            in the original tmQM dataset.
          - 100 steps were taken between snapshots (100 fs), with 30 total snapshots per system.
          - During MD sampling, gfn2-xtb accuracy was set to 2; all reported properties were calculated at 
            gfn2-xtb accuracy level 1.

          Scripts used to perform the sampling can be found https://github.com/chrisiacovella/xtb_config_gen"
  remote_dataset:
    doi: 10.5281/zenodo.15227086
    gz_data_file:
      length: 846498137
      md5: 9bf52e1a6ce2fa0a72c93600fb7c7431
      file_name: tmqm_xtb_dataset_PdZnFeCu_T200K_ncr_10_v1.1.hdf5.gz
    hdf5_data_file:
      md5: 624979457c74cb472bef4bbbba77920b
      file_name: tmqm_xtb_dataset_PdZnFeCu_T200K_ncr_10_v1.1.hdf5
    url: https://zenodo.org/records/15227086/files/tmqm_xtb_dataset_PdZnFeCu_T200_first10_v1.1.hdf5.gz

PdZnFeCu_T300K_v1.1:
  hdf5_schema: 2
  available_properties:
    - positions
    - atomic_numbers
    - total_charge
    - forces
    - dipole_moment_per_system
    - energies
    - partial_charges
  about: "This dataset contains 23177 unique systems with 490,861 total configurations, sampled at T=300K.
          
          This dataset is limited to systems that contain transition metals Pd, Zn, Fe, or Cu, 
          and also only contain elements C, H, P, S, O, N, F, Cl, or Br.
          
          Potentially problematic configurations (i.e., unstable or those with structural changes) were removed. 
          Briefly, bond inference was performed on the initial configuration using RDKit and a configuration was 
          excluded if any of those bond distances changed by more than 0.15 angstroms compared to the initial, 
          energy minimized state.

          This dataset was generated starting from the tmQM dataset; the original tmQM repository 
          (https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds to the data committed
          on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13).

          Each system in the tmQM database was evaluated using gfn2-xtb, and then a short MD simulation performed to 
          provide additional configurations of the systems.

          - The tblite package was used to evaluate the energetic of the system using the gfn2-xtb formalism.
          - MD simulations were performed using the Atomic Simulation Environment (ASE), using the Langevin integrator.
          - Simulations were performed at 300K with a 1 fs timestep and 0.01 1/fs friction damping factor. 
          - In all trajectories, the first configuration corresponds to the energy minimized configuration reported 
            in the original tmQM dataset.
          - 100 steps were taken between snapshots (100 fs), with 30 total snapshots per system.
          - During MD sampling, gfn2-xtb accuracy was set to 2; all reported properties were calculated at 
            gfn2-xtb accuracy level 1.

          Scripts used to perform the sampling can be found https://github.com/chrisiacovella/xtb_config_gen"
  remote_dataset:
    doi: 10.5281/zenodo.15227144
    gz_data_file:
      length: 1793203012
      md5: c133327bcf73182efecccaea51a34fdf
      file_name: tmqm_xtb_dataset_PdZnFeCu_T300K_v1.1.hdf5.gz
    hdf5_data_file:
      md5: 0bbee004a633654963b57811b690b128
      file_name: tmqm_xtb_dataset_PdZnFeCu_T300K_v1.1.hdf5
    url: https://zenodo.org/records/15227144/files/tmqm_xtb_dataset_PdZnFeCu_T300_v1.1.hdf5.gz

PdZnFeCu_T300K_ncr10_v1.1:
  hdf5_schema: 2
  available_properties:
    - positions
    - atomic_numbers
    - total_charge
    - forces
    - dipole_moment_per_system
    - energies
    - partial_charges
  about: "This dataset contains 23177 unique systems with 225,571 total configurations with a maximum number 
          of 10 configurations per system, sampled at T=300K. While 30 configurations were generated,
          this was restricted to only be 10 maximum per system for comparison to the T=100K data where 
          only 10 configurations were generated.
          
          This dataset is limited to systems that contain transition metals Pd, Zn, Fe, or Cu, 
          and also only contain elements C, H, P, S, O, N, F, Cl, or Br.
          
          Potentially problematic configurations (i.e., unstable or those with structural changes) were removed. 
          Briefly, bond inference was performed on the initial configuration using RDKit and a configuration was 
          excluded if any of those bond distances changed by more than 0.15 angstroms compared to the initial, 
          energy minimized state.

          This dataset was generated starting from the tmQM dataset; the original tmQM repository 
          (https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds to the data committed
          on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13).

          Each system in the tmQM database was evaluated using gfn2-xtb, and then a short MD simulation performed to 
          provide additional configurations of the systems.

          - The tblite package was used to evaluate the energetic of the system using the gfn2-xtb formalism.
          - MD simulations were performed using the Atomic Simulation Environment (ASE), using the Langevin integrator.
          - Simulations were performed at 300K with a 1 fs timestep and 0.01 1/fs friction damping factor. 
          - In all trajectories, the first configuration corresponds to the energy minimized configuration reported 
            in the original tmQM dataset.
          - 100 steps were taken between snapshots (100 fs), with 30 total snapshots per system.
          - During MD sampling, gfn2-xtb accuracy was set to 2; all reported properties were calculated at 
            gfn2-xtb accuracy level 1.

          Scripts used to perform the sampling can be found https://github.com/chrisiacovella/xtb_config_gen"
  remote_dataset:
    doi: 10.5281/zenodo.15227237
    gz_data_file:
      length: 831615242
      md5: f7c6dca18f52d99253cbdd74fe540032
      file_name: tmqm_xtb_dataset_PdZnFeCu_T300K_ncr_10_v1.1.hdf5.gz
    hdf5_data_file:
      md5: 60219cdee1c975f3eef2a193d16a3dcc
      file_name: tmqm_xtb_dataset_PdZnFeCu_T300K_ncr_10_v1.1.hdf5
    url: https://zenodo.org/records/15227237/files/tmqm_xtb_dataset_PdZnFeCu_T300_first10_v1.1.hdf5.gz