Dataset Module
The dataset module in modelforge provides a suite of functions and classes designed to retrieve and transform quantum mechanics (QM) datasets into a format compatible with torch.utils.data.Dataset as well as Pytorch Lightning LightningDataModule, facilitating the training of machine learning potentials. The module supports actions related to data storage, caching, retrieval, and the conversion of stored modelforge curated HDF5 files into PyTorch-compatible datasets for training purposes.
Modelforge currently provides a host of datasets containing a variety of molecular structures and properties. These datasets are curated into HDF5 formated files designed to be compatible with modelforge and hosted on zenodo.org (see the zenodo modelforge community); the udnerlying HDF5Dataset class provides a framework to download, cache, and process these files into a format compatible with torch.utils.data.Dataset, as previously noted. Local datasets can also be used that are stored in modelforge compatible HDF5 formats, allowing users to work with their own datasets without needing to upload them to a remote server or modifying the modelforge source. These can be specified by providing a configuration file, as will be described below.
Dataset Configuration TOML file
Dataset input configuration is typically managed using a TOML file. This configuration file is crucial during the training process as it provides values that need to be specified for the DataModule class, ensuring a flexible and customizable setup.
Below is a minimal example of a dataset configuration for the QM9 dataset.
[dataset]
dataset_name = "QM9"
version_select = "nc_1000_v1.2"
num_workers = 4
pin_memory = true
properties_of_interest = ["atomic_numbers", "positions", "internal_energy_at_0K", "dipole_moment_per_system"]
element_filter = []
regresssion_ase = false
[dataset.properties_assignment]
atomic_numbers = "atomic_numbers"
positions = "positions"
E = "internal_energy_at_0K"
Warning
The version_select field in the example indicates the use of a small subset of the QM9 dataset. To utilize the full dataset, set this variable to latest.
Explanation of the possible fields in the dataset configuration file:
dataset_name: Specifies the name of the dataset. For this example, it is QM9.
version_select: Indicates the version of the dataset to use. In this example, it points to a small subset of the dataset for quick testing. To use the full QM9 dataset, set this variable to latest.
number_of_worker: Determines the number of worker threads for data loading. Increasing the number of workers can speed up data loading but requires more memory. Must be 1 or greater.
pin_memory: A boolean flag indicating whether to pin memory for faster data transfer to the GPU. This is useful when training on a GPU and can improve performance by reducing data transfer times. Defaults to True.
properties_of_interest: Lists the properties of interest to load from the hdf5 file. This should include the properties that are relevant for training the model. The properties listed here must match those available in the dataset metadata; otherwise, a validation error will be raised. Loading properties that will not be used during training will use more memory.
properties_assignment: Maps the properties of interest to the corresponding fields in the dataset. This mapping is crucial for the correct loading of properties during training; note, many datasets contain multiple properties can potentially be swapped (e.g., energy calculated with or without dispersion corrections, different charge population schemes, different levels of theory, etc.). Any properties listed here must appear in the properties of interest list; the code will raise a validation error if this condition is not met. The possible fields to assign are defined by the
PropertyNames, which is listed below. Note, by default atomic_numbers, positions, and energy (E) are always required to be set.
class PropertyNames:
atomic_numbers: str # per-atom atomic numbers (atomic numbers are integers)
positions: str # per-atom positions (cartesian coordinates)
E: str # per-system energy (total energy)
F: Optional[str] = None # per-atom forces
total_charge: Optional[str] = None # per-system total charge
dipole_moment: Optional[str] = None # per-system dipole moment
spin_multiplicity: Optional[str] = None # per-system spin multiplicity
partial_charges: Optional[str] = None # per-atom partial charges
quadrupole_moment: Optional[str] = None # per-system quadrupole moment
element_filter: A filter to select systems with or without certain elements, which are denoted by atomic numbers. If a positive number is provided, then a datapoint that includes that element will be included. A negative values indicates which elements to exclude. For example, [[29]], selects all systems containing copper (29). [[29, -17]] selects all systems containing copper (29), but excludes from that list any that also contain chlorine (17). [[29, 1, -17]] would select all systems that contain copper (29) and hydrogen (H), and do not include chlorine (17). Everything contain within the same brackets acts as an “and” (i.e., all criteria must be satisfied). Providing two separate sublists acts as an “or”. For example, [[29,1], [78,-17]], states that a molecule can either have [copper (29) and hydrogen (1)] OR [platinum (78) and not chlorine (17)]. Leaving this field as an empty list or remove it will disable this element filtering feature.
regression_ase: A boolean flag indicating whether to use the atomic self-energies provided by the dataset (if available) or to calculate them via regression. If set to True, the atomic self-energies will be used as provided in the dataset metadata; if set to False, the self-energies will be calculated via regression. This is Optional and defaults to False.
Other fields that can be specified in the dataset configuration file include:
local_yaml_file: A path to a local dataset yaml file. This is Optional and defaults to None. If specified, it will be used to load the dataset metadata instead of the default metadata files provided by modelforge. This allows users to work with their own datasets without needing to upload them to a remote server or modifying the modelforge source.
dataset_cache_dir: Specifies the directory where the dataset files will be cached. This is useful for storing the dataset files locally to avoid downloading them multiple times; can be shared between multiple training runs.
Processing of dataset entries
Other common operations that are performed on the dataset as part of training machine learned potentials. These are defined in the training toml file:
Removing Self-Energies: Self-energies are per-element offsets subtracted to the total energy of a system. The energy offsets provide cleaner training data (e.g., MAE values of energy are closer to the scale of the energy itself).
Shifting the Energies: The energies can be shifted by a constant value potentially improving the stability and speed of training. This shifting can be set to be the minimum, maximum, or mean of the training dataset energies. The minimum energy shifting will shift by the smallest value, hence making all values positive; maximum shifting will make all values negative; mean shifting will center the energies around zero.
Splitting the Dataset: The dataset are split into training, validation, and test sets. This is crucial for evaluating the performance of the machine learning model and ensuring that it generalizes well to unseen data. Various schemes can be used to specify this.
Shifting the center of mass: The center of mass of the system can be shifted to the origin to enable calculation of the dipole moment.
Normalization and Scaling: Normalize the energies and other properties to ensure they are on a comparable scale, which can improve the stability and performance of the machine learning model. Note that this is done when atomic energies are predicted, i.e. the atomic energy (E_i) is scaled using the atomic energy distribution obtained from the training dataset: E_i = E_i_stddev * E_i_pred + E_i_mean.
However, note that these operations are not defined within the dataset configuration; these are specified in the training (self-energy, splitting, shifting COM) and potential (normalization) configuration TOML files.
Interacting with the Dataset Module
Here, we provide a brief overview of the DataModule class. Note, users will typically interact with this portion of the code indirectly via the TOML configuration files. The DataModule class handles preparing and setting up datasets for training. and is designed to integrate seamlessly with PyTorch Lightning, providing a user-friendly interface for dataset preparation and loading.
The following example demonstrates how to use the DataModule class to prepare and set up a dataset for training, where the similarity to the TOML configuration file should be evident.
from modelforge.dataset import DataModule
from modelforge.dataset.utils import RandomRecordSplittingStrategy
dataset_name = "QM9"
splitting_strategy = RandomRecordSplittingStrategy() # split randomly on system level
batch_size = 64
version_select = "latest"
remove_self_energies = True # remove the atomic self energies
regression_ase = False # use the atomic self energies provided by the dataset
data_module = DataModule(
name=dataset_name,
properties_of_interest=["atomic_numbers", "positions", "internal_energy_at_0K"]
properties_assignment={
"E": "energy",
"atomic_numbers": "atomic_numbers",
"positions": "positions",
},
splitting_strategy=splitting_strategy,
batch_size=batch_size,
version_select=version_select,
remove_self_energies=remove_self_energies,
regression_ase=regression_ase,
local_cache_dir="~/modelforge_run",
dataset_cache_dir="~/modelforge_hdf5_files",
)
# Prepare the data (downloads, processes, and caches if necessary)
data_module.prepare_data()
# Setup the data for training, validation, and testing
data_module.setup()
yaml Metadata File Structure
The HDF5Dataset class is designed to provide a generic class for loading in modelforge compatible HDF5 files. This relies upon reading in a YAML file which provide essential information about a given dataset, including the available versions, properties, and other relevant details, along with the downloard url used to fetch the dataset. These YAML metadata files are stored in the ~modelforge/dataset/yaml_files directory for the datasets provided by modelforge.
Below is a fictional example of a metadata YAML to demonstrate the key fields which includes the dataset name, version, description, atomic self-energies, and available properties.
dataset: fictional_dataset_name
latest: full_dataset_v1.1 # an alias for the lastest version of the full dataset
latest_test: nc_1000_v1.1 # an alias for the lastest version of the 1000 configuration test dataset
description: "A description of the dataset."
atomic_self_energies:
H: -1400.0 * kilojoule_per_mole
C: -10000.0 * kilojoule_per_mole
full_dataset_v1.1:
about: "This provides a curated hdf5 file for the fictional dataset designed to be compatible
with modelforge. This dataset contains 1234 unique records for 123456 total
configurations."
hdf5_schema: 2 # This specifies which modelforge HDF5 schema the version uses.
available_properties: # list of properties keys available in the dataset
- atomic_numbers
- positions
- dft_energy
remote_dataset:
doi: 10.1234/fictional_dataset.v1.1 # The DOI for the zenodo record of the dataset
url: https://zenodo.org/records/record_id/files/fictional_dataset_v1.1.hdf5.gz # The URL to download the gzipped HDF5 file
gz_data_file:
file_name: fictional_dataset_v1.1.hdf5.gz #name of the gzipped file that will be saved locally
length: 123456 # Length of the gzipped file in bytes, used for the progress bar
md5: gzip_checksum_value # The MD5 checksum of the gzipped file, used to verify the integrity of the downloaded file
hdf5_data_file:
file_name: fictional_dataset_v1.1.hdf5 # The name of the HDF5 file that will be saved locally after unzipping
md5: hdf5_checksum_value # The MD5 checksum of the HDF5 file, used to verify the integrity of the downloaded file
Note, HDF5 datafile stored on zenodo.org are stored as gzipped files to save space and bandwidth when downloading.
To specify metadata for a local dataset, the remote_dataset field can be omitted and replaced with the field local_dataset as shown below:
Available Datasets and Versions
Below is a description of the curated datasets currently available for modelforge and their corresponding metadata yaml files. These files can be found in the ~modelforge/dataset/yaml_files directory. The YAML files provide detailed information about each dataset, including the versions, properties, self energies and download URLs. As previously mentioned, for each dataset, multiple versions may be available. A 1000 configuration test dataset is provided for each dataset primarily useful for testing; several datasets also provide various subsets (e.g., limited to a subset of elements).
The dataset names used to specify the dataset in modelforge are provided in parentheses:
ANI1x (ani1x): dataset includes ~5 million density function theory calculations for small organic molecules containing H, C, N, and O. A subset of ~500k are computed with accurate coupled cluster methods.
- ANI-1x dataset:
Smith, J. S.; Nebgen, B.; Lubbers, N.; Isayev, O.; Roitberg, A. E. Less Is More: Sampling Chemical Space with Active Learning. J. Chem. Phys. 2018, 148 (24), 241733. https://doi.org/10.1063/1.5023802
- ANI-1ccx dataset:
Smith, J. S.; Nebgen, B. T.; Zubatyuk, R.; Lubbers, N.; Devereux, C.; Barros, K.; Tretiak, S.; Isayev, O.; Roitberg, A. E. Approaching Coupled Cluster Accuracy with a General-Purpose Neural Network Potential through Transfer Learning. Nat. Commun. 2019, 10 (1), 2903. https://doi.org/10.1038/s41467-019-10827-4
- ωB97x/def2-TZVPP data:
Zubatyuk, R.; Smith, J. S.; Leszczynski, J.; Isayev, O. Accurate and Transferable Multitask Prediction of Chemical Properties with an Atoms-in-Molecules Neural Network. Sci. Adv. 2019, 5 (8), eaav6490. https://doi.org/10.1126/sciadv.aav6490
dataset: ani1x
latest: full_dataset_v1.1
latest_test: nc_1000_v1.1
description: "ANI1x dataset includes ~5 million density function theory calculations
for small organic molecules containing H, C, N, and O.
A subset of ~500k are computed with accurate coupled cluster methods.
References:
ANI-1x dataset:
Smith, J. S.; Nebgen, B.; Lubbers, N.; Isayev, O.; Roitberg, A. E.
Less Is More: Sampling Chemical Space with Active Learning.
J. Chem. Phys. 2018, 148 (24), 241733.
https://doi.org/10.1063/1.5023802
https://arxiv.org/abs/1801.09319
ANI-1ccx dataset:
Smith, J. S.; Nebgen, B. T.; Zubatyuk, R.; Lubbers, N.; Devereux, C.; Barros, K.; Tretiak, S.; Isayev, O.; Roitberg, A. E.
Approaching Coupled Cluster Accuracy with a General-Purpose Neural Network Potential through Transfer Learning. N
at. Commun. 2019, 10 (1), 2903.
https://doi.org/10.1038/s41467-019-10827-4
wB97x/def2-TZVPP data:
Zubatyuk, R.; Smith, J. S.; Leszczynski, J.; Isayev, O.
Accurate and Transferable Multitask Prediction of Chemical Properties with an Atoms-in-Molecules Neural Network.
Sci. Adv. 2019, 5 (8), eaav6490.
https://doi.org/10.1126/sciadv.aav6490"
atomic_self_energies:
H: -0.5978583943827134 * hartree
C: -38.08933878049795 * hartree
N: -54.711968298621066 * hartree
O: -75.19106774742086 * hartree
full_dataset_v1.1:
about: "This provides a curated hdf5 file for the ANI-1x dataset designed to be compatible
with modelforge. This dataset contains 3114 unique records for 4956005 total configurations.
Note, individual configurations are partitioned into entries based on the array
of atomic species appearing in sequence in the source data file."
available_properties:
- atomic_numbers
- positions
- wb97x_dz_energy
- wb97x_tz_energy
- ccsd(t)_cbs_energy
- hf_dz_energy
- hf_tz_energy
- hf_qz_energy
- npno_ccsd(t)_dz_corr_energy
- npno_ccsd(t)_tz_corr_energy
- tpno_ccsd(t)_dz_corr_energy
- mp2_dz_corr_energy
- mp2_tz_corr_energy
- mp2_qz_corr_energy
- wb97x_dz_forces
- wb97x_tz_forces
- wb97x_dz_dipole
- wb97x_tz_dipole
- wb97x_dz_quadrupole
- wb97x_dz_cm5_charges
- wb97x_dz_hirshfeld_charges
- wb97x_tz_mbis_charges
hdf5_schema: 2
remote_dataset:
doi: 10.5281/zenodo.15447970
url: https://zenodo.org/records/15447970/files/ani1x_dataset_v1.1.hdf5.gz
gz_data_file:
file_name: ani1x_dataset_v1.1.hdf5.gz
length: 3514221240
md5: 0a93b1da5b36298cba7d6b14f7f65ded
hdf5_data_file:
file_name: ani1x_dataset_v1.1.hdf5
md5: b973e519602d24eb4a288e135875ea7e
nc_1000_v1.1:
about: "This provides a curated hdf5 file for a subset of the ANI-1x dataset designed
to be compatible with modelforge. This dataset contains 135 unique records for
1000 total configurations, with a maximum of 10 configurations per record. Note,
individual configurations are partitioned into entries based on the array of atomic
species appearing in sequence in the source data file."
available_properties:
- atomic_numbers
- positions
- wb97x_dz_energy
- wb97x_tz_energy
- ccsd(t)_cbs_energy
- hf_dz_energy
- hf_tz_energy
- hf_qz_energy
- npno_ccsd(t)_dz_corr_energy
- npno_ccsd(t)_tz_corr_energy
- tpno_ccsd(t)_dz_corr_energy
- mp2_dz_corr_energy
- mp2_tz_corr_energy
- mp2_qz_corr_energy
- wb97x_dz_forces
- wb97x_tz_forces
- wb97x_dz_dipole
- wb97x_tz_dipole
- wb97x_dz_quadrupole
- wb97x_dz_cm5_charges
- wb97x_dz_hirshfeld_charges
- wb97x_tz_mbis_charges
hdf5_schema: 2
remote_dataset:
doi: 10.5281/zenodo.15447763
url: https://zenodo.org/records/15447763/files/ani1x_dataset_v1.1_ntc_1000.hdf5.gz
gz_data_file:
file_name: ani1x_dataset_v1.1_ntc_1000.hdf5.gz
length: 1426717
md5: 4808bdbd49ae3cf7c2049bff439aaa8b
hdf5_data_file:
file_name: ani1x_dataset_v1.1_ntc_1000.hdf5
md5: ac1bc889f45c09b6971f3b56428b61ca
ANI2X (ani2x): The ANI-2x data set includes properties for small organic molecules that contain H, C, N, O, S, F, and Cl. This dataset contains 9651712 conformers for nearly 200,000 molecules. This will fetch data generated with the ωB97X/631Gd level of theory used in the original ANI-2x paper, calculated using Gaussian 09.
Devereux, C, Zubatyuk, R., Smith, J. et al. Extending the applicability of the ANI deep learning molecular potential to sulfur and halogens. Journal of Chemical Theory and Computation 16.7 (2020): 4192-4202. https://doi.org/10.1021/acs.jctc.0c00121
Fe II (fe_ii): The Fe(II) dataset includes 28834 total configurations of 384 unique Fe(II) organometallic complexes. Specifically, this includes 15568 HS geometries and 13266 LS geometries. These complexes originate from the Cambridge Structural Database (CSD) as curated by Nandy, et al. (Journal of Physical Chemistry Letters (2023), 14 (25), 10.1021/acs.jpclett.3c01214), and were filtered into “computation-ready” complexes, (those where both oxidation states and charges are already specified without hydrogen atoms missing in the structures), following the procedure outlined by Arunachalam, et al. (Journal of Chemical Physics (2022), 157 (18), 10.1063/5.0125700).
Hongni Jin and Kenneth M. Merz Jr, Modeling Fe(II) Complexes Using Neural Networks. Journal of Chemical Theory and Computation 2024 20 (6), 2551-2558 https://dx.doi.org/10.1021/acs.jctc.4c00063
dataset: fe_ii
latest: full_version_v1.1
latest_test: nc_1000_v1.1
description: "This dataset contains 384 unique systems with a total of 28,834 configurations
(note, the original publication states 383 unique systems).
The full Fe(II) dataset includes 28834 total configurations of Fe(II) organometallic complexes.
Specifically, this includes 15568 HS geometries and 13266 LS geometries.
These complexes originate from the Cambridge Structural Database (CSD) as curated by Nandy, et al.
(Journal of Physical Chemistry Letters (2023), 14 (25), 10.1021/acs.jpclett.3c01214),
and were filtered into “computation-ready” complexes, (those where both oxidation states and charges are
already specified without hydrogen atoms missing in the structures), following the procedure outlined by
Arunachalam, et al. (Journal of Chemical Physics (2022), 157 (18), 10.1063/5.0125700)
Citation to the original dataset:
Modeling Fe(II) Complexes Using Neural Networks
Hongni Jin and Kenneth M. Merz Jr.
Journal of Chemical Theory and Computation 2024 20 (6), 2551-2558
DOI: 10.1021/acs.jctc.4c00063
"
atomic_self_energies:
H: -257.8658772400123 * kilojoule_per_mole
C: -897.1371901363243 * kilojoule_per_mole
N: -683.3438581909822 * kilojoule_per_mole
O: -707.3905177027947 * kilojoule_per_mole
P: -445.4451443983543 * kilojoule_per_mole
S: -367.7922055565044 * kilojoule_per_mole
Cl: -227.0568137730898 * kilojoule_per_mole
Fe: 224.48679425562852 * kilojoule_per_mole
nc_1000_v1.1:
hdf5_schema: 2
available_properties:
- atomic_numbers
- positions
- total_charge
- forces
- energies
- spin_multiplicities
about: "This provides a modelforge curated hdf5 file for the Fe (II) dataset.
This dataset contains 102 unique systems with a total of 1000 configurations
(max of 10 configurations per system). "
remote_dataset:
doi: 10.5281/zenodo.15264766
url: https://zenodo.org/records/15264766/files/fe_II_ntc_1000_v1.1.hdf5.gz
gz_data_file:
length: 1425316
md5: 5337732f01cc99fac8c500c1df7a4b39
file_name: Fe_II_dataset_nc1000_v1.1.hdf5.gz
hdf5_data_file:
md5: 824a03eb589b4bf46d07d12fbfab507d
file_name: Fe_II_dataset_nc1000_v1.1.hdf5
full_version_v1.1:
hdf5_schema: 2
available_properties:
- atomic_numbers
- positions
- total_charge
- forces
- energies
- spin_multiplicities
remote_dataset:
doi: 10.5281/zenodo.15264721
url: https://zenodo.org/records/15264721/files/fe_II_v1.1.hdf5.gz
gz_data_file:
length: 39631216
md5: 55bc8488a1e115712b0c48a740ad73f1
file_name: Fe_II_dataset_v1.1.hdf5.gz
hdf5_data_file:
md5: 7569fe3c7f8acdef5dc3f6340af51d35
file_name: Fe_II_dataset_v1.1.hdf5
PhAlkEthOH (PhAlkEthOH): PhAlkEthOH: Phenyls, Alkanes, Ethers, and Alcohols (OH). The PhAlkEthOH dataset contains a collection of optimized trajectories of linear and cyclic molecules containing phyl rings, small alkanes, ethers, and alcohols containing only elements carbon, oxygen and hydrogen. For each unique system, configurations correspond to snapshots from the optimization trajectory. All QM datapoints were generated using B3LYP-D3BJ/DZVP level of theory, the default theory used for force field development by the Open Force Field Initiative.
Bannan CC, Mobley D. ChemPer: An Open Source Tool for Automatically Generating SMIRKS Patterns. ChemRxiv. 2019; https://dx.doi.org/10.26434/chemrxiv.8304578.v1
Wang Y, Fass J, Kaminow B, Herr JE, Rufa D, Zhang I, Pulido I, Henry M, Macdonald HE, Takaba K, Chodera JD. End-to-end differentiable construction of molecular mechanics force fields. Chemical Science. 2022;13(41):12016-33. https://dx.doi.org/10.1039/d2sc02739a
dataset: PhAlkEthOH
latest: full_dataset_v1.1
latest_test: nc_1000_v1.1
description: "PhAlkEthOH: Phenyls, Alkanes, Ethers, and Alcohols (OH)
The PhAlkEthOH dataset contains a collection of optimized trajectories of linear and cyclic molecules
containing phyl rings, small alkanes, ethers, and alcohols containing only elements carbon, oxygen and hydrogen.
For each unique system, configurations correspond to snapshots from the optimization trajectory.
All QM datapoints were generated using B3LYP-D3BJ/DZVP level of theory, the default theory used for force field
development by the Open Force Field Initiative.
The dataset was retrieved from The MolSSI qcarchive.
Related manuscripts:
Bannan CC, Mobley D.
ChemPer: An Open Source Tool for Automatically Generating SMIRKS Patterns.
ChemRxiv. 2019; doi:10.26434/chemrxiv.8304578.v1
Wang Y, Fass J, Kaminow B, Herr JE, Rufa D, Zhang I, Pulido I, Henry M, Macdonald HE, Takaba K, Chodera JD.
End-to-end differentiable construction of molecular mechanics force fields.
Chemical Science. 2022;13(41):12016-33. doi:10.1039/d2sc02739a
Repository used for generating and submitting the dataset via MolSSI qcfractal:
Gokey, T,.,
OpenFF Sandbox CHO PhAlkEthOH v1.0, 2020,
https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2020-09-18-OpenFF-Sandbox-CHO-PhAlkEthOH
"
atomic_self_energies:
H: -1596.6973305434612*kilojoule_per_mole
C: -100059.79872980758*kilojoule_per_mole
O: -197491.36594960644*kilojoule_per_mole
full_dataset_v1.1:
about: 'This provides a curated hdf5 file for the PhAlkEthOH dataset designed to
be compatible with modelforge. This dataset contains 10301 unique records for
1188691 total configurations. This excludes any configurations where the magnitude of any forces
on the atoms are greater than 1 hartree/bohr.'
available_properties:
- atomic_numbers
- positions
- total_charge
- dispersion_correction_energy
- dft_total_energy
- dispersion_correction_gradient
- dispersion_correction_force
- dft_total_gradient
- dft_total_force
- scf_dipole
hdf5_schema: 2
remote_dataset:
doi: 10.5281/zenodo.15398204
url: https://zenodo.org/records/15398204/files/PhAlkEthOH_openff_dataset_v1.1.hdf5.gz
gz_data_file:
file_name: PhAlkEthOH_openff_dataset_v1.1.hdf5.gz
length: 5445897672
md5: 5bd91d2533581478d35b1c32472c22a7
hdf5_data_file:
file_name: PhAlkEthOH_openff_dataset_v1.1.hdf5
md5: 643a6ff5387088d7cb0c70b5ba39a027
nc_1000_v1.1:
about: 'This provides a curated hdf5 file for a subset of the PhAlkEthOH dataset
designed to be compatible with modelforge. This dataset contains 101 unique records
for 1000 total configurations, with a maximum of 10 configurations per record.
This excludes any configurations where the magnitude of any forces on the atoms are greater than 1 hartree/bohr.'
available_properties:
- atomic_numbers
- positions
- total_charge
- dispersion_correction_energy
- dft_total_energy
- dispersion_correction_gradient
- dispersion_correction_force
- dft_total_gradient
- dft_total_force
- scf_dipole
hdf5_schema: 2
remote_dataset:
doi: 10.5281/zenodo.15417002
url: https://zenodo.org/records/15417002/files/PhAlkEthOH_openff_dataset_v1.1_ntc_1000.hdf5.gz
gz_data_file:
file_name: PhAlkEthOH_openff_dataset_v1.1_ntc_1000.hdf5.gz
length: 4053133
md5: d80e9ea3318dfeb3a40f2d614ca62dec
hdf5_data_file:
file_name: PhAlkEthOH_openff_dataset_v1.1_ntc_1000.hdf5
md5: 1e9160bfba6e8bf2c4b9677a7992a000
nc_1000_minimal_v1.1:
about: 'This provides a curated hdf5 file for a subset of the PhAlkEthOH dataset
designed to be compatible with modelforge. This dataset contains 1000 unique records
for 1000 total configurations, with only the final configuration of the optimization.
This excludes any configurations where the magnitude of any forces on the atoms are greater than 1 hartree/bohr.'
available_properties:
- atomic_numbers
- positions
- total_charge
- dispersion_correction_energy
- dft_total_energy
- dispersion_correction_gradient
- dispersion_correction_force
- dft_total_gradient
- dft_total_force
- scf_dipole
hdf5_schema: 2
remote_dataset:
doi: 10.5281/zenodo.15419953
url: https://zenodo.org/records/15419953/files/PhAlkEthOH_openff_dataset_v1.1_ntc_1000_minimal.hdf5.gz
gz_data_file:
file_name: PhAlkEthOH_openff_dataset_v1.1_ntc_1000_minimal.hdf5.gz
length: 5277672
md5: 03ac17b57f163baaf996a707ac281eb1
hdf5_data_file:
file_name: PhAlkEthOH_openff_dataset_v1.1_ntc_1000_minimal.hdf5
md5: 612036eca20f9bb715cafd4a897be8d0
full_dataset_minimal_v1.1:
about: 'This provides a curated hdf5 file for the PhAlkEthOH dataset designed to
be compatible with modelforge. This dataset contains 10301 unique records for
10301 total configurations, with only the final configuration of the optimization.
This excludes any configurations where the magnitude of any forces on the atoms are greater than 1 hartree/bohr.'
available_properties:
- atomic_numbers
- positions
- total_charge
- dispersion_correction_energy
- dft_total_energy
- dispersion_correction_gradient
- dispersion_correction_force
- dft_total_gradient
- dft_total_force
- scf_dipole
hdf5_schema: 2
remote_dataset:
doi: 10.5281/zenodo.15418579
url: https://zenodo.org/records/15418579/files/PhAlkEthOH_openff_dataset_v1.1_minimal.hdf5.gz
gz_data_file:
file_name: PhAlkEthOH_openff_dataset_v1.1_minimal.hdf5.gz
length: 38338045
md5: 2b7784da858c566b93d032a1faa00ad3
hdf5_data_file:
file_name: PhAlkEthOH_openff_dataset_v1.1_minimal.hdf5
md5: f35154b3114824cc3d882e9a53436c80
QM9 (qm9): A dataset of 134k small organic molecules, each containing up to 9 heavy atoms (C, O, N, F) and up to 29 atoms in total. It includes properties such as energies, forces, and dipole moments.
Ramakrishnan, R., Dral, P., Rupp, M. et al. ‘Quantum chemistry structures and properties of 134 kilo molecules.’Sci Data 1, 140022 (2014). https://doi.org/10.1038/sdata.2014.22
dataset: qm9
latest: full_dataset_v1.2
latest_test: nc_1000_v1.2
description: "The QM9 dataset includes 133,885 organic molecules with up to nine total heavy atoms (C,O,N,or F; excluding H).
All properties were calculated at the B3LYP/6-31G(2df,p) level of quantum chemistry.
Citation:
Ramakrishnan, R., Dral, P., Rupp, M. et al.
'Quantum chemistry structures and properties of 134 kilo molecules.'
Sci Data 1, 140022 (2014).
https://doi.org/10.1038/sdata.2014.22
DOI for dataset: 10.6084/m9.figshare.c.978904.v5"
atomic_self_energies:
H: -1313.4668615546 * kilojoule_per_mole
C: -99366.70745535441 * kilojoule_per_mole
N: -143309.9379722722 * kilojoule_per_mole
O: -197082.0671774158 * kilojoule_per_mole
F: -261811.54555874597 * kilojoule_per_mole
full_dataset_v1.2:
about: "This provides a curated hdf5 file for the qm9 dataset designed to be compatible
with modelforge. This dataset contains 133885 unique records for 133885 total
configurations. Note, the dataset contains only a single configuration per record.
This includes some minor corrections to the v1.1 dataset. Note, the dipole_moment_per_system
and dipole_moment_scalar_per_system properties are calculated from the partial_charges"
hdf5_schema: 2
available_properties:
- atomic_numbers
- positions
- partial_charges
- polarizability
- dipole_moment_per_system
- dipole_moment_scalar_per_system
- energy_of_homo
- lumo-homo_gap
- zero_point_vibrational_energy
- internal_energy_at_298.15K
- internal_energy_at_0K
- enthalpy_at_298.15K
- free_energy_at_298.15K
- heat_capacity_at_298.15K
- rotational_constants
- harmonic_vibrational_frequencies
- electronic_spatial_extent
remote_dataset:
doi: 10.5281/zenodo.17536462
url: https://zenodo.org/records/17536462/files/qm9_dataset_v1.2.hdf5.gz
gz_data_file:
file_name: qm9_dataset_v1.2.hdf5.gz
length: 301536746
md5: b53d5b83f1f24d7c6aa80612b9bd16dd
hdf5_data_file:
file_name: qm9_dataset_v1.2.hdf5
md5: 60ab35fed8d9a99be059cefb39f1f4b4
full_dataset_v1.1:
about: "This provides a curated hdf5 file for the qm9 dataset designed to be compatible
with modelforge. This dataset contains 133885 unique records for 133885 total
configurations. Note, the dataset contains only a single configuration per record."
hdf5_schema: 2
available_properties:
- atomic_numbers
- positions
- partial_charges
- polarizability
- dipole_moment_per_system
- dipole_moment_scalar_per_system
- energy_of_homo
- lumo-homo_gap
- zero_point_vibrational_energy
- internal_energy_at_298.15K
- internal_energy_at_0K
- enthalpy_at_298.15K
- free_energy_at_298.15K
- heat_capacity_at_298.15K
- rotational_constants
- harmonic_vibrational_frequencies
- electronic_spatial_extent
remote_dataset:
doi: 10.5281/zenodo.15390655
url: https://zenodo.org/records/15390655/files/qm9_dataset_v1.1.hdf5.gz
gz_data_file:
file_name: qm9_dataset_v1.1.hdf5.gz
length: 301537815
md5: 62d17d98d8143ac34f88bf1300b686c6
hdf5_data_file:
file_name: qm9_dataset_v1.1.hdf5
md5: 04e4c86d59374912849c64e899894719
nc_1000_v1.2:
about: "This provides a curated hdf5 file for a subset of the qm9 dataset designed
to be compatible with modelforge. This dataset contains 1000 unique records for
1000 total configurations. Note, the dataset contains only a single configuration
per record."
hdf5_schema: 2
available_properties:
- atomic_numbers
- positions
- partial_charges
- polarizability
- dipole_moment_per_system
- dipole_moment_scalar_per_system
- energy_of_homo
- lumo-homo_gap
- zero_point_vibrational_energy
- internal_energy_at_298.15K
- internal_energy_at_0K
- enthalpy_at_298.15K
- free_energy_at_298.15K
- heat_capacity_at_298.15K
- rotational_constants
- harmonic_vibrational_frequencies
- electronic_spatial_extent
remote_dataset:
doi: 10.5281/zenodo.17536526
url: https://zenodo.org/records/17536526/files/qm9_dataset_v1.2_ntc_1000.hdf5.gz
gz_data_file:
file_name: qm9_dataset_v1.2_ntc_1000.hdf5.gz
length: 1923749
md5: a6cf9528b4f2db977b96f7a441ba557c
hdf5_data_file:
file_name: qm9_dataset_v1.2_ntc_1000.hdf5
md5: befb3ef66d74f436ef399bf68eda9b90
nc_1000_v1.1:
about: "This provides a curated hdf5 file for a subset of the qm9 dataset designed
to be compatible with modelforge. This dataset contains 1000 unique records for
1000 total configurations. Note, the dataset contains only a single configuration
per record."
hdf5_schema: 2
available_properties:
- atomic_numbers
- positions
- partial_charges
- polarizability
- dipole_moment_per_system
- dipole_moment_scalar_per_system
- energy_of_homo
- lumo-homo_gap
- zero_point_vibrational_energy
- internal_energy_at_298.15K
- internal_energy_at_0K
- enthalpy_at_298.15K
- free_energy_at_298.15K
- heat_capacity_at_298.15K
- rotational_constants
- harmonic_vibrational_frequencies
- electronic_spatial_extent
remote_dataset:
doi: 10.5281/zenodo.15390593
url: https://zenodo.org/records/15390593/files/qm9_dataset_v1.1_ntc_1000.hdf5.gz
gz_data_file:
file_name: qm9_dataset_v1.1_ntc_1000.hdf5.gz
length: 1923749
md5: 54a2471bba075fcc2cdfe0b78bc567fa
hdf5_data_file:
file_name: qm9_dataset_v1.1_ntc_1000.hdf5
md5: befb3ef66d74f436ef399bf68eda9b90
SPICE 1 (spice1): The SPICE dataset contains 1.1 million conformations for a 19238 unique small molecules, dimers, dipeptides, and solvated amino acids. It includes 15 elements (H, Li, C, N, O, F, Na, Mg, P, S, Cl, K, Ca, Br, I)., charged and uncharged molecules, and a wide range of covalent and non-covalent interactions. It provides both forces and energies calculated at the ωB97M-D3(BJ)/def2-TZVPPD level of theory, using Psi4 1.4.1 along with other useful quantities such as multipole moments and bond orders.
Eastman, P., Behara, P.K., Dotson, D.L. et al. SPICE, A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials. Sci Data 10, 11 (2023). https://doi.org/10.1038/s41597-022-01882-6
dataset: spice1
latest: full_dataset_v1.1
latest_test: nc_1000_v1.1
description: "The SPICE dataset contains 1.1 million conformations for a diverse set of small molecules,
dimers, dipeptides, and solvated amino acids. It includes 15 elements (H, Li, C, N, O, F, Na, Mg, P, S, Cl, K, Ca, Br, I),
charged and uncharged molecules, and a wide range of covalent and non-covalent interactions.
It provides both forces and energies calculated at the ωB97M-D3(BJ)/def2-TZVPPD level of theory,
using Psi4 1.4.1 along with other useful quantities such as multipole moments and bond orders.
Reference:
Eastman, P., Behara, P.K., Dotson, D.L. et al. SPICE,
A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials.
Sci Data 10, 11 (2023). https://doi.org/10.1038/s41597-022-01882-6
Dataset DOI:
https://doi.org/10.5281/zenodo.8222043"
atomic_self_energies:
H: -1576.5513678678228*kilojoule_per_mole
Li: -19221.76009670645*kilojoule_per_mole
C: -100114.38959681295*kilojoule_per_mole
N: -143829.94579288512*kilojoule_per_mole
O: -197627.70305727186*kilojoule_per_mole
F: -262291.3177197502*kilojoule_per_mole
Na: -425714.1444283384*kilojoule_per_mole
Mg: -523447.29044746497*kilojoule_per_mole
P: -896460.9044578229*kilojoule_per_mole
S: -1045607.5830439369*kilojoule_per_mole
Cl: -1208414.168327362*kilojoule_per_mole
K: -1574847.955709633*kilojoule_per_mole
Ca: -1777543.2887296947*kilojoule_per_mole
Br: -6758454.442850963*kilojoule_per_mole
I: -781842.6578771132*kilojoule_per_mole
full_dataset_v1.1:
about: 'This provides a curated hdf5 file for the SPICE1 dataset designed to be
compatible with modelforge. This dataset contains 19238 unique records for 1110165
total configurations. '
available_properties:
- atomic_numbers
- positions
- total_charge
- dft_total_energy
- dft_total_force
- formation_energy
- mbis_charges
- mbis_dipoles
- mbis_quadrupoles
- mbis_octupoles
- scf_dipole
- scf_quadrupole
hdf5_schema: 2
remote_dataset:
doi: 10.5281/zenodo.15455029
gz_data_file:
file_name: spice_1_dataset_v1.1.hdf5.gz
length: 10843511030
md5: da9f4902d64fe957e1f3dd0a6a2463d0
hdf5_data_file:
file_name: spice_1_dataset_v1.1.hdf5
md5: ff7583aeafd7991bc7c04505a5c0ee02
url: https://zenodo.org/records/15455029/files/spice_1_dataset_v1.1.hdf5.gz
nc_1000_v1.1:
about: 'This provides a curated hdf5 file for the SPICE1 dataset designed to be
compatible with modelforge. This dataset contains 100 unique records for 1000
total configurations, with a maximum of 10 configurations per record. '
available_properties:
- atomic_numbers
- positions
- total_charge
- dft_total_energy
- dft_total_force
- formation_energy
- mbis_charges
- mbis_dipoles
- mbis_quadrupoles
- mbis_octupoles
- scf_dipole
- scf_quadrupole
hdf5_schema: 2
remote_dataset:
url: https://zenodo.org/records/15461060/files/spice_1_dataset_v1.1_ntc_1000.hdf5.gz
doi: 10.5281/zenodo.15461060
gz_data_file:
file_name: spice_1_dataset_v1.1_ntc_1000.hdf5.gz
length: 14716353
md5: 4d2dc9fd7b498f5f6bba6e8f4a5ffcfc
hdf5_data_file:
file_name: spice_1_dataset_v1.1_ntc_1000.hdf5
md5: 193f675e419bdf92883e1c4606b240c5
nc_1000_HCNOFClS_v1.1:
about: 'This provides a curated hdf5 file for a subset of the SPICE1 dataset designed
to be compatible with modelforge. This dataset contains 100 unique records for
1000 total configurations, with a maximum of 10 configurations per record. The
dataset is limited to the elements that are compatible with ANI2x: [H, C,
N, O, F, Cl, S]'
available_properties:
- atomic_numbers
- positions
- total_charge
- dft_total_energy
- dft_total_force
- formation_energy
- mbis_charges
- mbis_dipoles
- mbis_quadrupoles
- mbis_octupoles
- scf_dipole
- scf_quadrupole
hdf5_schema: 2
remote_dataset:
doi: 10.5281/zenodo.15461463
gz_data_file:
file_name: spice_1_dataset_v1.1_ntc_1000_HCNOFClS.hdf5.gz
length: 14328739
md5: 5a7e76f17694b1eb7f6368230589b586
hdf5_data_file:
file_name: spice_1_dataset_v1.1_ntc_1000_HCNOFClS.hdf5
md5: 3864fe361fdaa178a9582b8e6afff5c4
url: https://zenodo.org/records/15461463/files/spice_1_dataset_v1.1_ntc_1000_HCNOFClS.hdf5.gz
full_dataset_HCNOFClS_v1.1:
about: 'This provides a curated hdf5 file for the SPICE1 dataset designed to be
compatible with modelforge. This dataset contains 16565 unique records for 976408
total configurations. The dataset is limited to the elements that are compatible
with ANI2x: [H, C, N, O, F, Cl, S].'
available_properties:
- atomic_numbers
- positions
- total_charge
- dft_total_energy
- dft_total_force
- formation_energy
- mbis_charges
- mbis_dipoles
- mbis_quadrupoles
- mbis_octupoles
- scf_dipole
- scf_quadrupole
hdf5_schema: 2
remote_dataset:
doi: 10.5281/zenodo.15461488
gz_data_file:
file_name: spice_1_dataset_v1.1_HCNOFClS.hdf5.gz
length: 9712216033
md5: 961a71318f1bf6ef7e5d042d7aa4fc1c
hdf5_data_file:
file_name: spice_1_dataset_v1.1_HCNOFClS.hdf5
md5: 28e1c57d6552be297e2accb3c35172c0
url: https://zenodo.org/records/15461488/files/spice_1_dataset_v1.1_HCNOFClS.hdf5.gz
SPICE 1 OpenFF (spice1_openff): The full SPICE 1 OpenFF dataset is a subset of the SPICE 1 dataset, and includes 18782 unique records for 1106949 total configurations for 14 different elements (H, Li, C, N, O, F, Na, Mg, P, S, Cl, K, Ca, Br). All QM datapoints were generated using B3LYP-D3BJ/DZVP level of theory as this is the default theory used for force field development by the Open Force Field Initiative. was generated using ωB97M-D3(BJ)/def2-TZVPPD level of theory.
dataset: spice1_openff
latest: full_dataset_v2.1
latest_test: nc_1000_v2.1
description: "Small-molecule/Protein Interaction Chemical Energies (SPICE), calculated at the default OpenFF level of theory.
The SPICE dataset contains conformations for a diverse set of small molecules, dimers, dipeptides, and solvated amino acids
For 14 different elements (H, Li, C, N, O, F, Na, Mg, P, S, Cl, K, Ca, Br) in both charged and uncharged molecules.
Note, the original SPICE 1 dataset also includes Iodine (I), but systems with this element are not included in this
dataset, as a small subset of the SPICE1 dataset was not included in the OpenFF dataset, namely:
-SPICE Ion Pairs Single Points Dataset v1.1
-SPICE DES370K Single Points Dataset Supplement v1.0
and a subset of calculations were not able to be converged fully.
The full SPICE 1 OpenFF dataset includes 18782 unique records for 1106949 total configurations while the original
SPICE 1 dataset includes 19238 unique records for 1110165 configurations
(both excluding those with forces > 1 hartree/bohr).
All QM datapoints retrieved were generated using B3LYP-D3BJ/DZVP level of theory as this is the default theory used
for force field development by the Open Force Field Initiative; the original SPICE 1 dataset
was generated using ωB97M-D3(BJ)/def2-TZVPPD level of theory.
Reference to the original SPICE 1 dataset publication:
Eastman, P., Behara, P.K., Dotson, D.L. et al. SPICE,
A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials.
Sci Data 10, 11 (2023). https://doi.org/10.1038/s41597-022-01882-6
DOI to original SPICE 1 dataset (not at the OpenFF level of theory):
https://doi.org/10.5281/zenodo.8222043"
atomic_self_energies:
H: -1581.5384137007973*kilojoule_per_mole
Li: -19322.614940687432*kilojoule_per_mole
C: -100058.83756708907*kilojoule_per_mole
N: -143747.52575867812*kilojoule_per_mole
O: -197522.95360021706*kilojoule_per_mole
F: -262187.61306363455*kilojoule_per_mole
Na: -425595.8497308719*kilojoule_per_mole
Mg: -523296.52031790506*kilojoule_per_mole
P: -896206.0276563794*kilojoule_per_mole
S: -1045356.0997863387*kilojoule_per_mole
Cl: -1208153.2961282134*kilojoule_per_mole
K: -1574540.1515238197*kilojoule_per_mole
Ca: -1777205.6941588672*kilojoule_per_mole
Br: -6757224.691339369*kilojoule_per_mole
full_dataset_v2.1:
about: 'This provides a curated hdf5 file for the SPICE1 openff dataset designed
to be compatible with modelforge. This dataset contains 18782 unique records for
1106949 total configurations.
This excludes any configurations where the magnitude of any forces on the atoms are greater than 1 hartree/bohr.'
available_properties:
- atomic_numbers
- positions
- total_charge
- dft_energy
- dispersion_correction_energy
- dft_total_energy
- dft_force
- dispersion_correction_force
- dft_total_force
- mbis_charges
- scf_dipole
hdf5_schema: 2
remote_dataset:
doi: 10.5281/zenodo.15475919
gz_data_file:
file_name: spice_1_openff_dataset_v2.1.hdf5.gz
length: 3373088790
md5: 25bc8d0bdf77a6667a26964a09e082c7
hdf5_data_file:
file_name: spice_1_openff_dataset_v2.1.hdf5
md5: 65ded4727fb49a1fba6f7224e5cf43ec
url: https://zenodo.org/records/15475919/files/spice_1_openff_dataset_v2.1.hdf5.gz
nc_1000_v2.1:
about: 'This provides a curated hdf5 file for the SPICE1 openff dataset designed
to be compatible with modelforge. This dataset contains 100 unique records for
1000 total configurations, with a maximum of 10 configurations per record.
This excludes any configurations where the magnitude of any forces on the atoms are greater than 1 hartree/bohr.'
available_properties:
- atomic_numbers
- positions
- total_charge
- dft_energy
- dispersion_correction_energy
- dft_total_energy
- dft_force
- dispersion_correction_force
- dft_total_force
- mbis_charges
- scf_dipole
hdf5_schema: 2
remote_dataset:
doi: 10.5281/zenodo.15448194
gz_data_file:
file_name: spice_1_openff_dataset_v2.1_ntc_1000.hdf5.gz
length: 5770925
md5: 8b2729a28aa947576e485566926498bb
hdf5_data_file:
file_name: spice_1_openff_dataset_v2.1_ntc_1000.hdf5
md5: 6187fcc7ff5d95e6608beecb09de9e77
url: https://zenodo.org/records/15448194/files/spice_1_openff_dataset_v2.1_ntc_1000.hdf5.gz
nc_1000_HCNOFClS_v2.1:
about: 'This provides a curated hdf5 file for a subset of the SPICE1 openff dataset
designed to be compatible with modelforge. This dataset contains 100 unique records
for 1000 total configurations, with a maximum of 10 configurations per record.
The dataset is limited to the elements that are compatible with ANI2x NNP: [H,
C, N, O, F, Cl, S]. This excludes any configurations where the magnitude of any forces
on the atoms are greater than 1 hartree/bohr.'
available_properties:
- atomic_numbers
- positions
- total_charge
- dft_energy
- dispersion_correction_energy
- dft_total_energy
- dft_force
- dispersion_correction_force
- dft_total_force
- mbis_charges
- scf_dipole
hdf5_schema: 2
remote_dataset:
doi: 10.5281/zenodo.15476628
gz_data_file:
file_name: spice_1_openff_dataset_v2.1_ntc_1000_HCNOFClS.hdf5.gz
length: 5770934
md5: 7d9f626252e30dd6902da62b58505799
hdf5_data_file:
file_name: spice_1_openff_dataset_v2.1_ntc_1000_HCNOFClS.hdf5
md5: 6187fcc7ff5d95e6608beecb09de9e77
url: https://zenodo.org/records/15476628/files/spice_1_openff_dataset_v2.1_ntc_1000_HCNOFClS.hdf5.gz
full_dataset_HCNOFClS_v2.1:
about: 'This provides a curated hdf5 file for the SPICE1 openff dataset designed
to be compatible with modelforge. This dataset contains 16560 unique records for
996941 total configurations. The dataset is limited to the elements that are compatible with ANI2x NNP:
[H, C, N, O, F, Cl, S]. This excludes any configurations where the magnitude of any forces
on the atoms are greater than 1 hartree/bohr.'
available_properties:
- atomic_numbers
- positions
- total_charge
- dft_energy
- dispersion_correction_energy
- dft_total_energy
- dft_force
- dispersion_correction_force
- dft_total_force
- mbis_charges
- scf_dipole
hdf5_schema: 2
remote_dataset:
doi: 10.5281/zenodo.15476646
gz_data_file:
file_name: spice_1_openff_dataset_v2.1_HCNOFClS.hdf5.gz
length: 3056052148
md5: 172fa98f0abdaaf1a9b64812dc70cd81
hdf5_data_file:
file_name: spice_1_openff_dataset_v2.1_HCNOFClS.hdf5
md5: 5d8a9e7b005f14627eea779630446430
url: https://zenodo.org/records/15476646/files/spice_1_openff_dataset_v2.1_HCNOFClS.hdf5.gz
SPICE 2 (spice2): The SPICE2 dataset contains conformations for a diverse set of small molecules, dimers, dipeptides, and solvated amino acids. It includes 17 elements (H, Li, B, C, N, O, F, Na, Mg, Si, P, S, Cl, K, Ca, Br, I), charged and uncharged molecules, and a wide range of covalent and non-covalent interactions. It provides both forces and energies calculated at the ωB97M-D3(BJ)/def2-TZVPPD level of theory, using Psi4 along with other useful quantities such as multipole moments and bond orders.
Eastman, P., Pritchard, B. P., Chodera, J. D., & Markland, T. E. Nutmeg and SPICE: models and data for biomolecular machine learning. Journal of chemical theory and computation, 20(19), 8583-8593 (2024). https://doi.org/10.1021/acs.jctc.4c00794
dataset: spice2
latest: full_dataset_v1.1
latest_test: nc_1000_v1.1
desciption: "The SPICE2 dataset contains conformations for a diverse set of small molecules,
dimers, dipeptides, and solvated amino acids. It includes 15 elements, charged and
uncharged molecules, and a wide range of covalent and non-covalent interactions.
It provides both forces and energies calculated at the ωB97M-D3(BJ)/def2-TZVPPD level of theory,
using Psi4 along with other useful quantities such as multipole moments and bond orders.
SPICE 2.0.1 zenodo release:
https://zenodo.org/records/10835749
SPICE 2 github repository:
https://github.com/openmm/spice-dataset
Reference to SPICE 2 publication:
Eastman, P., Pritchard, B. P., Chodera, J. D., & Markland, T. E.
Nutmeg and SPICE: models and data for biomolecular machine learning.
Journal of chemical theory and computation, 20(19), 8583-8593 (2024).
https://doi.org/10.1021/acs.jctc.4c00794
Reference to original SPICE publication:
Eastman, P., Behara, P.K., Dotson, D.L. et al.
SPICE, A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials.
Sci Data 10, 11 (2023). https://doi.org/10.1038/s41597-022-01882-6
"
atomic_self_energies:
H: -1579.292447611522*kilojoule_per_mole
Li: -19206.917958817387*kilojoule_per_mole
B: -65569.16994509766*kilojoule_per_mole
C: -100112.36246928677*kilojoule_per_mole
N: -143837.09401820396*kilojoule_per_mole
O: -197640.6741826767*kilojoule_per_mole
F: -262292.74039535556*kilojoule_per_mole
Na: -425700.2376810506*kilojoule_per_mole
Mg: -523428.7340381498*kilojoule_per_mole
Si: -760410.2547546176*kilojoule_per_mole
P: -896470.875967255*kilojoule_per_mole
S: -1045588.2785379887*kilojoule_per_mole
Cl: -1208421.9382829564*kilojoule_per_mole
K: -1574833.7856186344*kilojoule_per_mole
Ca: -1777525.2197214952*kilojoule_per_mole
Br: -6758450.475943951*kilojoule_per_mole
I: -781827.0797396759*kilojoule_per_mole
full_dataset_v1.1:
about: This provides a curated hdf5 file for the SPICE2 dataset designed to be compatible
with modelforge. This dataset contains 113985 unique records for 2008126 total
configurations.
available_properties:
- atomic_numbers
- positions
- total_charge
- dft_total_energy
- dft_total_force
- formation_energy
- mbis_charges
- mbis_dipoles
- mbis_quadrupoles
- mbis_octupoles
- scf_dipole
- scf_quadrupole
hdf5_schema: 2
remote_dataset:
doi: 10.5281/zenodo.15419983
url: https://zenodo.org/records/15419983/files/spice_2_dataset_v1.1.hdf5.gz
gz_data_file:
file_name: spice_2_dataset_v1.1.hdf5.gz
length: 25532396063
md5: 5ccaba21944da5e3f86f19389767b48f
hdf5_data_file:
file_name: spice_2_dataset_v1.1.hdf5
md5: e2628a66e1f8f428151bcdbbfb1c41a7
nc_1000_v1.1:
about: This provides a curated hdf5 file for a subset of the SPICE2 dataset designed
to be compatible with modelforge. This dataset contains 403 unique records for
1000 total configurations, with a maximum of 10 configurations per record.
available_properties:
- atomic_numbers
- positions
- total_charge
- dft_total_energy
- dft_total_force
- formation_energy
- mbis_charges
- mbis_dipoles
- mbis_quadrupoles
- mbis_octupoles
- scf_dipole
- scf_quadrupole
hdf5_schema: 2
remote_dataset:
doi: 10.5281/zenodo.15421048
url: https://zenodo.org/records/15421048/files/spice_2_dataset_v1.1_ntc_1000.hdf5.gz
gz_data_file:
file_name: spice_2_dataset_v1.1_ntc_1000.hdf5.gz
length: 26204494
md5: 6b62d3410bba634bd0709ef750c78ca4
hdf5_data_file:
file_name: spice_2_dataset_v1.1_ntc_1000.hdf5
md5: bcba4cdbfb9225306a6bfed01c28b364
full_dataset_HCNOFClS_v1.1:
about: 'This provides a curated hdf5 file for the SPICE2 dataset designed to be
compatible with modelforge. This dataset contains 97279 unique records for 1620239
total configurations. The dataset is limited to the elements that are compatible
with the ANI2x NNP architecture: [H, C, N, O, F, Cl, S].'
available_properties:
- atomic_numbers
- positions
- total_charge
- dft_total_energy
- dft_total_force
- formation_energy
- mbis_charges
- mbis_dipoles
- mbis_quadrupoles
- mbis_octupoles
- scf_dipole
- scf_quadrupole
hdf5_schema: 2
remote_dataset:
doi: 10.5281/zenodo.15428145
url: https://zenodo.org/records/15428145/files/spice_2_dataset_v1.1_HCNOFClS.hdf5.gz
gz_data_file:
file_name: spice_2_dataset_v1.1_HCNOFClS.hdf5.gz
length: 21093537996
md5: 37c512f923e8b7ba62b4a59740d190c7
hdf5_data_file:
file_name: spice_2_dataset_v1.1_HCNOFClS.hdf5
md5: 178e4b6f1d812895ae23e3823f032ea9
nc_1000_HCNOFClS_v1.1:
about: 'This provides a curated hdf5 file for a subset of the SPICE2 dataset designed
to be compatible with modelforge. This dataset contains 374 unique records for
1000 total configurations, with a maximum of 10 configurations per record. The
dataset is limited to the elements that are compatible with ANI2x NNP architecture:
[H, C, N, O, F, Cl, S].'
available_properties:
- atomic_numbers
- positions
- total_charge
- dft_total_energy
- dft_total_force
- formation_energy
- mbis_charges
- mbis_dipoles
- mbis_quadrupoles
- mbis_octupoles
- scf_dipole
- scf_quadrupole
hdf5_schema: 2
remote_dataset:
url: https://zenodo.org/records/15429129/files/spice_2_dataset_v1.1_ntc_1000_HCNOFClS.hdf5.gz
doi: 10.5281/zenodo.15429129
gz_data_file:
file_name: spice_2_dataset_v1.1_ntc_1000_HCNOFClS.hdf5.gz
length: 26256317
md5: 3e0d774ffe05c2f6653d4249bdf6de98
hdf5_data_file:
file_name: spice_2_dataset_v1.1_ntc_1000_HCNOFClS.hdf5
md5: 1e9ef56d71de23223ed73d2e4db510f5
full_dataset_HCNOF_v1.1:
about: This provides a curated hdf5 file for the SPICE2 dataset designed to be compatible
with modelforge. This dataset contains 57037 unique records for 928073 total configurations.
The dataset is limited to the elements ['H', 'C', 'N', 'O', 'F'].
available_properties:
- atomic_numbers
- positions
- total_charge
- dft_total_energy
- dft_total_force
- formation_energy
- mbis_charges
- mbis_dipoles
- mbis_quadrupoles
- mbis_octupoles
- scf_dipole
- scf_quadrupole
hdf5_schema: 2
remote_dataset:
doi: 10.5281/zenodo.15579090
url: https://zenodo.org/records/15579090/files/spice_2_dataset_v1.1_HCNOF.hdf5.gz
gz_data_file:
file_name: spice_2_dataset_v1.1_HCNOF.hdf5.gz
length: 11712091901
md5: 56d109fdb026f8a1224dfff76abbd79a
hdf5_data_file:
file_name: spice_2_dataset_v1.1_HCNOF.hdf5
md5: afa15240f07082517ec6155cc35093e7
SPICE 2 OpenFF (spice2_openff): The SPICE 2 OpenFF dataset is a subset of the SPICE 2 dataset, and includes 112628 unique records for 1971769 total configurations for 16 elements (H, Li, B, C, N, O, F, Na, Mg, Si, P, S, Cl, K, Ca, Br) in both charged and uncharged molecules, and a wide range of covalent and non-covalent interactions. All QM datapoints were generated using B3LYP-D3BJ/DZVP level of theory as this is the default theory used for force field development by the Open Force Field Initiative.
dataset: spice2_openff
latest: full_dataset_v1.1
latest_test: nc_1000_v1.1
description: "Small-molecule/Protein Interaction Chemical Energies (SPICE), calculated at the default OpenFF level of theory.
SPICE 2 Openff includes 16 elements (H, Li, B, C, N, O, F, Na, Mg, Si, P, S, Cl, K, Ca, Br) in both
charged and uncharged molecules, and a wide range of covalent and non-covalent interactions.
Note, the original SPICE 2 dataset also includes Iodine (I), but systems with this element are not included in this
dataset, as a small subset of the SPICE2 dataset was not included in the OpenFF dataset, namely:
-SPICE Ion Pairs Single Points Dataset v1.1
-SPICE DES370K Single Points Dataset Supplement v1.0
and a subset of calculations were not able to be converged fully.
The full SPICE 2 OpenFF dataset includes 112628 unique records for 1971769 total configurations while the
original SPICE 2 dataset includes 113985 unique records for 2008126 total configurations
(both excluding those with forces > 1 hartree/bohr).
All datapoints in the SPICE 2 OpenFF dataset were generated using B3LYP-D3BJ/DZVP level of theory, as this is
the default theory used for force field development by the Open Force Field Initiative; the original SPICE 2 dataset
was generated using ωB97M-D3(BJ)/def2-TZVPPD level of theory.
Reference to original SPICE 2 publication:
Eastman, P., Pritchard, B. P., Chodera, J. D., & Markland, T. E
Nutmeg and SPICE: models and data for biomolecular machine learning.
Journal of chemical theory and computation, 20(19), 8583-8593 (2024).
https://doi.org/10.1021/acs.jctc.4c00794
Reference to the original SPICE 1 publication:
Eastman, P., Behara, P.K., Dotson, D.L. et al. SPICE,
A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials.
Sci Data 10, 11 (2023).
https://doi.org/10.1038/s41597-022-01882-6
DOI to original SPICE 1 and 2 datasets (not at the OpenFF level of theory):
10.5281/zenodo.7258939"
atomic_self_energies: # these need to be replaced
H: -1583.7235381559833*kilojoule_per_mole
Li: -19347.515361076046*kilojoule_per_mole
B: -65529.65768442646*kilojoule_per_mole
C: -100057.69285084814*kilojoule_per_mole
N: -143754.50034055635*kilojoule_per_mole
O: -197534.06499229133*kilojoule_per_mole
F: -262187.95257544337*kilojoule_per_mole
Na: -425581.9250584926*kilojoule_per_mole
Mg: -523280.8560382089*kilojoule_per_mole
Si: -760216.8701576393*kilojoule_per_mole
P: -896215.2333703283*kilojoule_per_mole
S: -1045334.9682307948*kilojoule_per_mole
Cl: -1208159.74701762*kilojoule_per_mole
K: -1574526.5435217225*kilojoule_per_mole
Ca: -1777190.5165390158*kilojoule_per_mole
Br: -6757220.35723592*kilojoule_per_mole,
nc_1000_v1.1:
about: 'This provides a curated hdf5 file for the SPICE2 openff dataset designed
to be compatible with modelforge. This dataset contains 100 unique records for
1000 total configurations, with a maximum of 10 configurations per record.
This excludes any configurations where the magnitude of any forces on the
atoms are greater than 1 hartree/bohr.'
available_properties:
- atomic_numbers
- positions
- total_charge
- dft_energy
- dispersion_correction_energy
- dft_total_energy
- dft_force
- dispersion_correction_force
- dft_total_force
- mbis_charges
- scf_dipole
hdf5_schema: 2
remote_dataset:
doi: 10.5281/zenodo.15476920
gz_data_file:
file_name: spice_2_openff_dataset_v1.1_ntc_1000.hdf5.gz
length: 5717857
md5: 957c1d89fb698d8e195eaf2b7bca2362
hdf5_data_file:
file_name: spice_2_openff_dataset_v1.1_ntc_1000.hdf5
md5: 51f0f237be809c764db2585d9a541be6
url: https://zenodo.org/records/15476920/files/spice_2_openff_dataset_v1.1_ntc_1000.hdf5.gz
nc_1000_HCNOFClS_v1.1:
about: 'This provides a curated hdf5 file for a subset of the SPICE2 openff dataset
designed to be compatible with modelforge. This dataset contains 100 unique records
for 1000 total configurations, with a maximum of 10 configurations per record.
This excludes any configurations where the magnitude of any forces on the atoms are greater than 1 hartree/bohr.
The dataset is limited to the elements that are compatible with ANI2x NNP: [H, C, N, O, F, Cl, S]'
available_properties:
- atomic_numbers
- positions
- total_charge
- dft_energy
- dispersion_correction_energy
- dft_total_energy
- dft_force
- dispersion_correction_force
- dft_total_force
- mbis_charges
- scf_dipole
hdf5_schema: 2
remote_dataset:
doi: 10.5281/zenodo.15477012
gz_data_file:
file_name: spice_2_openff_dataset_v1.1_ntc_1000_HCNOFClS.hdf5.gz
length: 5717866
md5: ae8cc23e52692ff90b90378d6a7ea226
hdf5_data_file:
file_name: spice_2_openff_dataset_v1.1_ntc_1000_HCNOFClS.hdf5
md5: 51f0f237be809c764db2585d9a541be6
url: https://zenodo.org/records/15477012/files/spice_2_openff_dataset_v1.1_ntc_1000_HCNOFClS.hdf5.gz
full_dataset_HCNOFClS_v1.1:
about: 'This provides a curated hdf5 file for the SPICE2 openff dataset designed
to be compatible with modelforge. This dataset contains 97274 unique records for
1620018 total configurations. This excludes any configurations where the
magnitude of any forces on the atoms are greater than 1 hartree/bohr.
The dataset is limited to the elements that are compatible with ANI2x NNP: [H, C, N, O, F, Cl, S]'
available_properties:
- atomic_numbers
- positions
- total_charge
- dft_energy
- dispersion_correction_energy
- dft_total_energy
- dft_force
- dispersion_correction_force
- dft_total_force
- mbis_charges
- scf_dipole
hdf5_schema: 2
remote_dataset:
doi: 10.5281/zenodo.15477069
gz_data_file:
file_name: spice_2_openff_dataset_v1.1_HCNOFClS.hdf5.gz
length: 5912909189
md5: dcae9a6853e964ec4d0d3f0201babbce
hdf5_data_file:
file_name: spice_2_openff_dataset_v1.1_HCNOFClS.hdf5
md5: 75839044d1d5e846bd92c8035323ef69
url: https://zenodo.org/records/15477069/files/spice_2_openff_dataset_v1.1_HCNOFClS.hdf5.gz
full_dataset_v1.1:
about: 'This provides a curated hdf5 file for the SPICE2 openff dataset designed
to be compatible with modelforge. This dataset contains 112628 unique records
for 1971769 total configurations. This excludes any configurations where
the magnitude of any forces on the atoms are greater than 1 hartree/bohr.
'
available_properties:
- atomic_numbers
- positions
- total_charge
- dft_energy
- dispersion_correction_energy
- dft_total_energy
- dft_force
- dispersion_correction_force
- dft_total_force
- mbis_charges
- scf_dipole
hdf5_schema: 2
remote_dataset:
doi: 10.5281/zenodo.15477437
gz_data_file:
file_name: spice_2_openff_dataset_v1.1.hdf5.gz
length: 7133416023
md5: 5896c787315a6473df14db219b7ce5ef
hdf5_data_file:
file_name: spice_2_openff_dataset_v1.1.hdf5
md5: 7cb903f659447fb838083fd523d14e0a
url: https://zenodo.org/records/15477437/files/spice_2_openff_dataset_v1.1.hdf5.gz
tmQM (tmqm): The tmQM dataset contains the geometries and properties of 108,541 mononuclear complexes extracted from the Cambridge Structural Database, including Werner, bioinorganic, and organometallic complexes based on a large variety of organic ligands and 30 transition metals (the 3d, 4d, and 5d from groups 3 to 12). All complexes are closed-shell, with a formal charge in the range {+1, 0, −1}e
David Balcells and Bastian Bjerkem Skjelstad, tmQM Dataset—Quantum Geometries and Properties of 86k Transition Metal Complexes. Journal of Chemical Information and Modeling 2020 60 (12), 6135-6146 https://dx.doi.org/10.1021/acs.jcim.0c01041”
dataset: tmqm
latest: full_dataset_v1.1
latest_test: nc_1000_v1.1
description: "The tmQM dataset contains the geometries and properties of 108,541 (86,665 in the original published set)
mononuclear complexes extracted from the Cambridge Structural Database, including Werner, bioinorganic,
and organometallic complexes based on a large variety of organic ligands and 30 transition metals
(the 3d, 4d, and 5d from groups 3 to 12).
All complexes are closed-shell, with a formal charge in the range {+1, 0, −1}e
Original Citation:
David Balcells and Bastian Bjerkem Skjelstad,
tmQM Dataset—Quantum Geometries and Properties of 86k Transition Metal Complexes
Journal of Chemical Information and Modeling 2020 60 (12), 6135-6146
DOI: 10.1021/acs.jcim.0c01041"
atomic_self_energies:
H: -1588.690123425219 * kilojoule_per_mole
B: -65302.33351128112 * kilojoule_per_mole
C: -100005.01654855655 * kilojoule_per_mole
N: -143654.56892638578 * kilojoule_per_mole
O: -197361.76171021158 * kilojoule_per_mole
F: -261926.20424903592 * kilojoule_per_mole
Si: -760035.7764038445 * kilojoule_per_mole
P: -896075.6280215026 * kilojoule_per_mole
S: -1045229.0663264447 * kilojoule_per_mole
Cl: -1208038.6914349555 * kilojoule_per_mole
Sc: -1997181.018901612 * kilojoule_per_mole
Ti: -2230278.4864245243 * kilojoule_per_mole
V: -2478389.354471244 * kilojoule_per_mole
Cr: -2741967.2994972193 * kilojoule_per_mole
Mn: -3021546.098466564 * kilojoule_per_mole
Fe: -3317395.5973328506 * kilojoule_per_mole
Co: -3629935.0938135427 * kilojoule_per_mole
Ni: -3959571.3270608196 * kilojoule_per_mole
Cu: -4306402.576897981 * kilojoule_per_mole
Zn: -4671113.922983311 * kilojoule_per_mole
As: -5869526.931994888 * kilojoule_per_mole
Se: -6304454.897949699 * kilojoule_per_mole
Br: -6757541.6132786125 * kilojoule_per_mole
Y: -100773.37555590154 * kilojoule_per_mole
Zr: -123709.71011983423 * kilojoule_per_mole
Nb: -149762.5718722473 * kilojoule_per_mole
Mo: -179149.19860244964 * kilojoule_per_mole
Tc: -212135.93903845942 * kilojoule_per_mole
Ru: -248990.05884762504 * kilojoule_per_mole
Rh: -290061.85478664236 * kilojoule_per_mole
Pd: -335541.5978772224 * kilojoule_per_mole
Ag: -385322.4473000328 * kilojoule_per_mole
Cd: -440036.922094555 * kilojoule_per_mole
I: -781538.4859057926 * kilojoule_per_mole
La: -82991.16536291114 * kilojoule_per_mole
Hf: -126278.27589562583 * kilojoule_per_mole
Ta: -149779.8577882084 * kilojoule_per_mole
W: -176248.83619057274 * kilojoule_per_mole
Re: -205660.78711122595 * kilojoule_per_mole
Os: -237930.364964837 * kilojoule_per_mole
Ir: -273916.6051998535 * kilojoule_per_mole
Pt: -313193.3625088413 * kilojoule_per_mole
Au: -356039.1883931112 * kilojoule_per_mole
Hg: -402551.5785347049 * kilojoule_per_mole
full_dataset_v1.1:
hdf5_schema: 2
available_properties:
- atomic_numbers
- positions
- partial_charges
- total_Charge
- spin_multiplicities
- electronic_energy
- dispersion_energy
- total_energy
- dipole_moment_magnitude
- dipole_moment_computed
- dipole_moment_computed_scaled
- energy_of_lumo
- energy_of_homo
- homo_lumo_gap
about:
"This dataset contains 108541 unique systems with 108541 total configurations (1 configuration per system).
The original tmQM repository (https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds
to the data committed on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13)."
remote_dataset:
doi: 10.5281/zenodo.15331686
url: https://zenodo.org/records/15331686/files/tmqm_dataset_v1.1.hdf5.gz
gz_data_file:
length: 328729880
md5: 16332938d98f71023a963b36e7ae2191
file_name: tmqm_dataset_v1.1.hdf5.gz
hdf5_data_file:
md5: b9454280213b488081955d46ebb378eb
file_name: tmqm_dataset_v1.1.hdf5
nc_1000_v1.1:
hdf5_schema: 2
available_properties:
- atomic_numbers
- positions
- partial_charges
- total_Charge
- spin_multiplicities
- electronic_energy
- dispersion_energy
- total_energy
- dipole_moment_magnitude
- dipole_moment_computed
- dipole_moment_computed_scaled
- energy_of_lumo
- energy_of_homo
- homo_lumo_gap
about:
"This dataset contains 108541 unique systems with 108541 total configurations (1 configuration per system).
The original tmQM repository (https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds
to the data committed on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13)."
remote_dataset:
doi: 10.5281/zenodo.15331808
url: https://zenodo.org/records/15331808/files/tmqm_dataset_v1.1_ntc_1000.hdf5.gz
gz_data_file:
length: 3605547
md5: d9bfdccaa16b440d5c95c90be6480dab
file_name: tmqm_dataset_nc_1000_v1.1.hdf5.gz
hdf5_data_file:
md5: d7fe366ecf8fee0264fcfac17e0b0b87
file_name: tmqm_dataset_nc_1000_v1.1.hdf5
PdZnFeCu_CHPSONFClBr_nc_1000_v1.1:
hdf5_schema: 2
available_properties:
- atomic_numbers
- positions
- partial_charges
- total_Charge
- spin_multiplicities
- electronic_energy
- dispersion_energy
- total_energy
- dipole_moment_magnitude
- dipole_moment_computed
- dipole_moment_computed_scaled
- energy_of_lumo
- energy_of_homo
- homo_lumo_gap
about:
"This dataset contains 1000 unique systems with 1000 total configurations (1 configuration per system).
This is a 1000 conformer test system.
This dataset is limited to systems that contain transition metals Pd, Zn, Fe, or Cu,
and also only contain elements C, H, P, S, O, N, F, Cl, or Br.
The original tmQM repository (https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds
to the data committed on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13)."
remote_dataset:
doi: 10.5281/zenodo.15345571
url: https://zenodo.org/records/15345571/files/tmqm_dataset_PdZnFeCu_CHPSONFClBr_ntc_1000_v1.1.hdf5.gz
gz_data_file:
length: 3137141
md5: 672025a093b4adf3233186be1c8393f4
file_name: tmqm_dataset_PdZnFeCu_CHPSONFClBr_nc_1000_v1.1.hdf5.gz
hdf5_data_file:
md5: 531977896b6074912098a0ec36baa665
file_name: tmqm_dataset_PdZnFeCu_CHPSONFClBr_nc_1000_v1.1.hdf5
PdZnFeCu_CHPSONFClBr_v1.1:
hdf5_schema: 2
available_properties:
- atomic_numbers
- positions
- partial_charges
- total_Charge
- spin_multiplicities
- electronic_energy
- dispersion_energy
- total_energy
- dipole_moment_magnitude
- dipole_moment_computed
- dipole_moment_computed_scaled
- energy_of_lumo
- energy_of_homo
- homo_lumo_gap
about:
"This dataset contains 23183 unique systems with 23183 total configurations (1 configuration per system).
This dataset is limited to systems that contain transition metals Pd, Zn, Fe, or Cu,
and also only contain elements C, H, P, S, O, N, F, Cl, or Br.
The original tmQM repository (https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds
to the data committed on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13)."
remote_dataset:
doi: 10.5281/zenodo.15345127
url: https://zenodo.org/records/15345127/files/tmqm_dataset_PdZnFeCu_CHPSONFClBr_v1.1.hdf5.gz
gz_data_file:
length: 69495572
md5: ebd0ab2f6ab569980e2a2ce5c273146f
file_name: tmqm_dataset_PdZnFeCu_CHPSONFClBr_v1.1.hdf5.gz
hdf5_data_file:
md5: 4a0127a5c12c8c8a88c87eba058821ed
file_name: tmqm_dataset_PdZnFeCu_CHPSONFClBr_v1.1.hdf5
PdZnFeCuNiPtIrRhCrAg_CHPSONFClBr_v1.1:
hdf5_schema: 2
available_properties:
- atomic_numbers
- positions
- partial_charges
- total_Charge
- spin_multiplicities
- electronic_energy
- dispersion_energy
- total_energy
- dipole_moment_magnitude
- dipole_moment_computed
- dipole_moment_computed_scaled
- energy_of_lumo
- energy_of_homo
- homo_lumo_gap
about:
"This dataset contains 51258 unique systems with 51258 total configurations (1 configuration per system).
This dataset is limited to systems that contain transition metals Pd, Zn, Fe, Cu, Ni, Pt, Ir, Rh, Cr, or Ag
and also only contain elements C, H, P, S, O, N, F, Cl, or Br.
The original tmQM repository (https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds
to the data committed on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13)."
remote_dataset:
doi: 10.5281/zenodo.15345149
url: https://zenodo.org/records/15345149/files/tmqm_dataset_PdZnFeCuNiPtIrRhCrAg_CHPSONFClBr_v1.1.hdf5.gz
gz_data_file:
length: 153375735
md5: 4add21b5612b9132d7ce4b7d232dfe3e
file_name: tmqm_dataset_PdZnFeCuNiPtIrRhCrAg_CHPSONFClBr_v1.1.hdf5.gz
hdf5_data_file:
md5: b0702a256be4f5cf581441ac35c7d7af
file_name: tmqm_dataset_PdZnFeCuNiPtIrRhCrAg_CHPSONFClBr_v1.1.hdf5
tmQM-xtb (tmqm_xtb): The tmQM-xtb dataset include configurations generated using GFN2-xTB-based MD simulations starting from the energy-minimized geometries in the tmQM dataset. Energies, forces, charges, and dipole moments were calculated using the GFN2-xTB method. Several variants of the dataset are available, generated using different temperatures for MD sampling.
dataset: tmqm_xtb
latest: PdZnFeCuNiPtIrRhCrAg_T100K_v1.1
latest_test: nc_1000_v1.1
description: "The tmQM-xtb dataset performs GFN2-xTB-based MD simulations starting from the energy-minimized geometries
in the tmQM dataset.
The original tmQM dataset contains the geometries and properties of mononuclear complexes extracted from the
Cambridge Structural Database, including Werner, bioinorganic, and organometallic complexes based on a large
variety of organic ligands and 30 transition metals (the 3d, 4d, and 5d from groups 3 to 12).
All complexes are closed-shell, with a formal charge in the range {+1, 0, −1}e .
Original Citation:
David Balcells and Bastian Bjerkem Skjelstad,
tmQM Dataset—Quantum Geometries and Properties of 86k Transition Metal Complexes
Journal of Chemical Information and Modeling 2020 60 (12), 6135-6146
DOI: 10.1021/acs.jcim.0c01041 "
atomic_self_energies:
H: -1346.9991827591664 * kilojoule_per_mole
C: -5617.968751828634 * kilojoule_per_mole
N: -7672.109298341974 * kilojoule_per_mole
O: -10704.649544039614 * kilojoule_per_mole
F: -12450.413867238472 * kilojoule_per_mole
Ir: -6598.040049917221 * kilojoule_per_mole
Pt: -8576.086025878865 * kilojoule_per_mole
P: -12100.053458428218 * kilojoule_per_mole
S: -4944.219007863149 * kilojoule_per_mole
Cl: -7938.35372876674 * kilojoule_per_mole
Cr: -12369.173271985948 * kilojoule_per_mole
Fe: -9663.693466916478 * kilojoule_per_mole
Ni: -1252.3530347274261 * kilojoule_per_mole
Cu: -10894.410447334463 * kilojoule_per_mole
Zn: -10182.310751929233 * kilojoule_per_mole
Br: -11739.997032286365 * kilojoule_per_mole
Rh: -9590.608153082434 * kilojoule_per_mole
Pd: -9713.417530536652 * kilojoule_per_mole
Ag: -11641.150291664564 * kilojoule_per_mole
PdZnFeCu_T100K_single_config_v1.0:
hdf5_schema: 2
available_properties:
- positions
- atomic_numbers
- total_charge
- forces
- dipole_moment_per_system
- energies
- partial_charges
about: "This dataset contains 23134 unique systems with 23134 total configurations (1 configuration per system)
Each configuration corresponds to the geometry distributed as part of the original tmQM dataset. with
no MD sampling applied.
This dataset is limited to systems that contain transition metals Pd, Zn, Fe, or Cu,
and also only contain elements C, H, P, S, O, N, F, Cl, or Br.
This dataset was generated starting from the tmQM dataset; the original tmQM repository
(https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds to the data committed
on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13).
The tblite package was used to evaluate the energetic of the system using the gfn2-xtb formalism,
using the calculator as part of the Atomic Simulation Environment (ASE), calculated at accuracy level 1.
Scripts used to perform the sampling can be found at https://github.com/chrisiacovella/xtb_config_gen"
remote_dataset:
doi: 10.5281/zenodo.15021819
url: https://zenodo.org/records/15021819/files/tmqm_xtb_dataset_PdZnFeCu_T100_first_v1.0.hdf5.gz
gz_data_file:
length: 96544047
md5: cb86823c62d2127c209cded323c03eef
file_name: tmqm_xtb_dataset_PdZnFeCu_T100K_single_config_v1.hdf5.gz
hdf5_data_file:
md5: 96811817c3d65fdbe1c3691125ff0664
file_name: tmqm_xtb_dataset_PdZnFeCu_T100K_single_config_v1.hdf5
nc_1000_v1.1:
hdf5_schema: 2
available_properties:
- positions
- atomic_numbers
- total_charge
- forces
- dipole_moment_per_system
- energies
- partial_charges
about: "This dataset contains 103 unique systems with 1000 total configurations (max of 10 configurations per system),
where MD sampling was performed at T=100K.
This dataset is limited to systems that contain transition metals Pd, Zn, Fe, or Cu, and also only contain
elements C, H, P, S, O, N, F, Cl, or Br.
Potentially problematic configurations (i.e., unstable or those with structural changes) were removed.
Briefly, bond inference was performed on the initial configuration using RDKit and a configuration was
excluded if any of those bond distances changed by more than 0.15 angstroms compared to the initial,
energy minimized state.
This dataset was generated starting from the tmQM dataset; the original tmQM repository
(https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds to the data committed
on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13).
Each system in the tmQM database was evaluated using gfn2-xtb, and then a short MD simulation performed to
provide additional configurations of the systems.
- The tblite package was used to evaluate the energetic of the system using the gfn2-xtb formalism.
- MD simulations were performed using the Atomic Simulation Environment (ASE), using the Langevin integrator.
- Simulations were performed at 100K with a 1 fs timestep and 0.01 1/fs friction damping factor.
- In all trajectories, the first configuration corresponds to the energy minimized configuration reported
in the original tmQM dataset.
- 100 steps were taken between snapshots (100 fs), with 10 total snapshots per system.
- During MD sampling, gfn2-xtb accuracy was set to 2; all reported properties were calculated at
gfn2-xtb accuracy level 1.
Scripts used to perform the sampling can be found https://github.com/chrisiacovella/xtb_config_gen"
remote_dataset:
doi: 10.5281/zenodo.15059379
gz_data_file:
length: 3425268
md5: 43e80a303a9e02c47cc679ee8502cd11
file_name: tmqm_xtb_dataset_PdZnFeCu_T100K_ntc_1000_v1.1.hdf5.gz
hdf5_data_file:
md5: 6c8676c119a4f0028b3cf9c7de5d577c
file_name: tmqm_xtb_dataset_PdZnFeCu_T100K_ntc_1000_v1.1.hdf5
url: https://zenodo.org/records/15059379/files/tmqm_xtb_dataset_PdZnFeCu_T100_ntc_1000_v1.1.hdf5.gz
PdZnFeCu_T100K_v1.1:
hdf5_schema: 2
available_properties:
- positions
- atomic_numbers
- total_charge
- forces
- dipole_moment_per_system
- energies
- partial_charges
about: "This dataset contains 23134 unique systems with 225068 total configurations, where MD sampling was performed
at T=100K.
This dataset is limited to systems that contain transition metals Pd, Zn, Fe, or Cu, and also only contain
elements C, H, P, S, O, N, F, Cl, or Br.
Potentially problematic configurations (i.e., unstable or those with structural changes) were removed.
Briefly, bond inference was performed on the initial configuration using RDKit and a configuration was
excluded if any of those bond distances changed by more than 0.15 angstroms compared to the initial,
energy minimized state.
This dataset was generated starting from the tmQM dataset; the original tmQM repository
(https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds to the data committed
on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13).
Each system in the tmQM database was evaluated using gfn2-xtb, and then a short MD simulation performed to
provide additional configurations of the systems.
- The tblite package was used to evaluate the energetic of the system using the gfn2-xtb formalism.
- MD simulations were performed using the Atomic Simulation Environment (ASE), using the Langevin integrator.
- Simulations were performed at 100K with a 1 fs timestep and 0.01 1/fs friction damping factor.
- In all trajectories, the first configuration corresponds to the energy minimized configuration reported
in the original tmQM dataset.
- 100 steps were taken between snapshots (100 fs), with 10 total snapshots per system.
- During MD sampling, gfn2-xtb accuracy was set to 2; all reported properties were calculated at
gfn2-xtb accuracy level 1.
Scripts used to perform the sampling can be found https://github.com/chrisiacovella/xtb_config_gen"
remote_dataset:
doi: 10.5281/zenodo.15059433
gz_data_file:
length: 828124531
md5: c7c8d48d7077dfbd10635a17ffa38848
file_name: tmqm_xtb_dataset_PdZnFeCu_T100K_v1.1.hdf5.gz
hdf5_data_file:
md5: e121c9182a2c6621d9f92f8d4b4a8188
file_name: tmqm_xtb_dataset_PdZnFeCu_T100K_v1.1.hdf5
url: https://zenodo.org/records/15059433/files/tmqm_xtb_dataset_PdZnFeCu_T100_v1.1.hdf5.gz
PdZnFeCuNiPtIrRhCrAg_T100K_v1.1:
hdf5_schema: 2
available_properties:
- positions
- atomic_numbers
- total_charge
- forces
- dipole_moment_per_system
- energies
- partial_charges
about: "This dataset contains 51160 unique systems with 499087 total configurations, with MD sampling at T=100K.
This dataset is limited to systems that contain transition metals Pd, Zn, Fe, Cu, Ni, Pt, Ir, Rh, Cr, or Ag
and also only contain elements C, H, P, S, O, N, F, Cl, or Br.
Potentially problematic configurations (i.e., unstable or those with structural changes) were removed.
Briefly, bond inference was performed on the initial configuration using RDKit and a configuration was
excluded if any of those bond distances changed by more than 0.15 angstroms compared to the initial,
energy minimized state.
This dataset was generated starting from the tmQM dataset; the original tmQM repository
(https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds to the data committed
on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13).
Each system in the tmQM database was evaluated using gfn2-xtb, and then a short MD simulation performed to
provide additional configurations of the systems.
- The tblite package was used to evaluate the energetic of the system using the gfn2-xtb formalism.
- MD simulations were performed using the Atomic Simulation Environment (ASE), using the Langevin integrator.
- Simulations were performed at 100K with a 1 fs timestep and 0.01 1/fs friction damping factor.
- In all trajectories, the first configuration corresponds to the energy minimized configuration reported
in the original tmQM dataset.
- 100 steps were taken between snapshots (100 fs), with 10 total snapshots per system.
- During MD sampling, gfn2-xtb accuracy was set to 2; all reported properties were calculated at
gfn2-xtb accuracy level 1.
Scripts used to perform the sampling can be found https://github.com/chrisiacovella/xtb_config_gen"
remote_dataset:
doi: 10.5281/zenodo.15059465
gz_data_file:
length: 1829694005
md5: 9efd03d7c18901b5618489db6209d0a0
file_name: tmqm_xtb_dataset_PdZnFeCuNiPtIrRhCrAg_T100K_v1.1.hdf5.gz
hdf5_data_file:
md5: 16fa0b45afb7ff3ca9568cca54d89de0
file_name: tmqm_xtb_dataset_PdZnFeCuNiPtIrRhCrAg_T100K_v1.1.hdf5
url: https://zenodo.org/records/15059465/files/tmqm_xtb_dataset_PdZnFeCuNiPtIrCrAg_T100_v1.1.hdf5.gz
PdZnFeCuNiPtIrRhCrAg_T200K_v1.1:
hdf5_schema: 2
available_properties:
- positions
- atomic_numbers
- total_charge
- forces
- dipole_moment_per_system
- energies
- partial_charges
about: "This dataset contains 51249 unique systems with 1317625 total configurations, sampled at T=200K.
This dataset is limited to systems that contain transition metals Pd, Zn, Fe, Cu, Ni, Pt, Ir, Rh, Cr, or Ag
and also only contain elements C, H, P, S, O, N, F, Cl, or Br.
Potentially problematic configurations (i.e., unstable or those with structural changes) were removed.
Briefly, bond inference was performed on the initial configuration using RDKit and a configuration was
excluded if any of those bond distances changed by more than 0.15 angstroms compared to the initial,
energy minimized state.
This dataset was generated starting from the tmQM dataset; the original tmQM repository
(https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds to the data committed
on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13).
Each system in the tmQM database was evaluated using gfn2-xtb, and then a short MD simulation performed to
provide additional configurations of the systems.
- The tblite package was used to evaluate the energetic of the system using the gfn2-xtb formalism.
- MD simulations were performed using the Atomic Simulation Environment (ASE), using the Langevin integrator.
- Simulations were performed at 200K with a 1 fs timestep and 0.01 1/fs friction damping factor.
- In all trajectories, the first configuration corresponds to the energy minimized configuration reported
in the original tmQM dataset.
- 100 steps were taken between snapshots (100 fs), with 30 total snapshots per system.
- During MD sampling, gfn2-xtb accuracy was set to 2; all reported properties were calculated at
gfn2-xtb accuracy level 1.
Scripts used to perform the sampling can be found https://github.com/chrisiacovella/xtb_config_gen"
remote_dataset:
doi: 10.5281/zenodo.15226046
gz_data_file:
length: 4749276362
md5: a1d03a025ecfd48d7dc286b3d71cb900
file_name: tmqm_xtb_dataset_PdZnFeCuNiPtIrRhCrAg_T200K_v1.1.hdf5.gz
hdf5_data_file:
md5: 19203071e1ff743d3402a36750f74b86
file_name: tmqm_xtb_dataset_PdZnFeCuNiPtIrRhCrAg_T200K_v1.1.hdf5
url: https://zenodo.org/records/15226046/files/tmqm_xtb_dataset_PdZnFeCuNiPtIrCrAg_T200_v1.1.hdf5.gz
PdZnFeCuNiPtIrRhCrAg_T300K_v1.1:
hdf5_schema: 2
available_properties:
- positions
- atomic_numbers
- total_charge
- forces
- dipole_moment_per_system
- energies
- partial_charges
about: "This dataset contains 51252 unique systems with 1118541 total configurations, sampled at T=300K.
This dataset is limited to systems that contain transition metals Pd, Zn, Fe, Cu, Ni, Pt, Ir, Rh, Cr, or Ag
and also only contain elements C, H, P, S, O, N, F, Cl, or Br.
Potentially problematic configurations (i.e., unstable or those with structural changes) were removed.
Briefly, bond inference was performed on the initial configuration using RDKit and a configuration was
excluded if any of those bond distances changed by more than 0.15 angstroms compared to the initial,
energy minimized state.
This dataset was generated starting from the tmQM dataset; the original tmQM repository
(https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds to the data committed
on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13).
Each system in the tmQM database was evaluated using gfn2-xtb, and then a short MD simulation performed to
provide additional configurations of the systems.
- The tblite package was used to evaluate the energetic of the system using the gfn2-xtb formalism.
- MD simulations were performed using the Atomic Simulation Environment (ASE), using the Langevin integrator.
- Simulations were performed at 300K with a 1 fs timestep and 0.01 1/fs friction damping factor.
- In all trajectories, the first configuration corresponds to the energy minimized configuration reported
in the original tmQM dataset.
- 100 steps were taken between snapshots (100 fs), with 30 total snapshots per system.
- During MD sampling, gfn2-xtb accuracy was set to 2; all reported properties were calculated at
gfn2-xtb accuracy level 1.
Scripts used to perform the sampling can be found https://github.com/chrisiacovella/xtb_config_gen"
remote_dataset:
doi: 10.5281/zenodo.15226639
gz_data_file:
length: 4062452149
md5: 5005e4b8c329031b14ceeef67cb67644
file_name: tmqm_xtb_dataset_PdZnFeCuNiPtIrRhCrAg_T300K_v1.1.hdf5.gz
hdf5_data_file:
md5: 999454490fe077a88c409970504f7f41
file_name: tmqm_xtb_dataset_PdZnFeCuNiPtIrRhCrAg_T300K_v1.1.hdf5
url: https://zenodo.org/records/15226639/files/tmqm_xtb_dataset_PdZnFeCuNiPtIrCrAg_T300_v1.1.hdf5.gz
PdZnFeCu_T200K_v1.1:
hdf5_schema: 2
available_properties:
- positions
- atomic_numbers
- total_charge
- forces
- dipole_moment_per_system
- energies
- partial_charges
about: "This dataset contains 23175 unique systems with 584,935 total configurations, sampled at T=200K.
This dataset is limited to systems that contain transition metals Pd, Zn, Fe, or Cu,
and also only contain elements C, H, P, S, O, N, F, Cl, or Br.
Potentially problematic configurations (i.e., unstable or those with structural changes) were removed.
Briefly, bond inference was performed on the initial configuration using RDKit and a configuration was
excluded if any of those bond distances changed by more than 0.15 angstroms compared to the initial,
energy minimized state.
This dataset was generated starting from the tmQM dataset; the original tmQM repository
(https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds to the data committed
on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13).
Each system in the tmQM database was evaluated using gfn2-xtb, and then a short MD simulation performed to
provide additional configurations of the systems.
- The tblite package was used to evaluate the energetic of the system using the gfn2-xtb formalism.
- MD simulations were performed using the Atomic Simulation Environment (ASE), using the Langevin integrator.
- Simulations were performed at 200K with a 1 fs timestep and 0.01 1/fs friction damping factor.
- In all trajectories, the first configuration corresponds to the energy minimized configuration reported
in the original tmQM dataset.
- 100 steps were taken between snapshots (100 fs), with 30 total snapshots per system.
- During MD sampling, gfn2-xtb accuracy was set to 2; all reported properties were calculated at
gfn2-xtb accuracy level 1.
Scripts used to perform the sampling can be found https://github.com/chrisiacovella/xtb_config_gen"
remote_dataset:
doi: 10.5281/zenodo.15227023
gz_data_file:
length: 2118955545
md5: 834ec7ed3670dfaaacc78beccc4b8a8d
file_name: tmqm_xtb_dataset_PdZnFeCu_T200K_v1.1.hdf5.gz
hdf5_data_file:
md5: 4cb6d3e170e5cb9c63e2cac58b84a33f
file_name: tmqm_xtb_dataset_PdZnFeCu_T200K_v1.1.hdf5
url: https://zenodo.org/records/15227023/files/tmqm_xtb_dataset_PdZnFeCu_T200_v1.1.hdf5.gz
PdZnFeCu_T200K_ncr10_v1.1:
hdf5_schema: 2
available_properties:
- positions
- atomic_numbers
- total_charge
- forces
- dipole_moment_per_system
- energies
- partial_charges
about: "This dataset contains 23175b unique systems with 230,030 total configurations (maximum of 10 per system),
sampled at T=200K. While 30 configurations were generated per system during sampling, this dataset limits
this to be a maximum of 10 configurations per system, to allow for more direct comparison with T=100K data.
This dataset is limited to systems that contain transition metals Pd, Zn, Fe, or Cu,
and also only contain elements C, H, P, S, O, N, F, Cl, or Br.
Potentially problematic configurations (i.e., unstable or those with structural changes) were removed.
Briefly, bond inference was performed on the initial configuration using RDKit and a configuration was
excluded if any of those bond distances changed by more than 0.15 angstroms compared to the initial,
energy minimized state.
This dataset was generated starting from the tmQM dataset; the original tmQM repository
(https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds to the data committed
on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13).
Each system in the tmQM database was evaluated using gfn2-xtb, and then a short MD simulation performed to
provide additional configurations of the systems.
- The tblite package was used to evaluate the energetic of the system using the gfn2-xtb formalism.
- MD simulations were performed using the Atomic Simulation Environment (ASE), using the Langevin integrator.
- Simulations were performed at 200K with a 1 fs timestep and 0.01 1/fs friction damping factor.
- In all trajectories, the first configuration corresponds to the energy minimized configuration reported
in the original tmQM dataset.
- 100 steps were taken between snapshots (100 fs), with 30 total snapshots per system.
- During MD sampling, gfn2-xtb accuracy was set to 2; all reported properties were calculated at
gfn2-xtb accuracy level 1.
Scripts used to perform the sampling can be found https://github.com/chrisiacovella/xtb_config_gen"
remote_dataset:
doi: 10.5281/zenodo.15227086
gz_data_file:
length: 846498137
md5: 9bf52e1a6ce2fa0a72c93600fb7c7431
file_name: tmqm_xtb_dataset_PdZnFeCu_T200K_ncr_10_v1.1.hdf5.gz
hdf5_data_file:
md5: 624979457c74cb472bef4bbbba77920b
file_name: tmqm_xtb_dataset_PdZnFeCu_T200K_ncr_10_v1.1.hdf5
url: https://zenodo.org/records/15227086/files/tmqm_xtb_dataset_PdZnFeCu_T200_first10_v1.1.hdf5.gz
PdZnFeCu_T300K_v1.1:
hdf5_schema: 2
available_properties:
- positions
- atomic_numbers
- total_charge
- forces
- dipole_moment_per_system
- energies
- partial_charges
about: "This dataset contains 23177 unique systems with 490,861 total configurations, sampled at T=300K.
This dataset is limited to systems that contain transition metals Pd, Zn, Fe, or Cu,
and also only contain elements C, H, P, S, O, N, F, Cl, or Br.
Potentially problematic configurations (i.e., unstable or those with structural changes) were removed.
Briefly, bond inference was performed on the initial configuration using RDKit and a configuration was
excluded if any of those bond distances changed by more than 0.15 angstroms compared to the initial,
energy minimized state.
This dataset was generated starting from the tmQM dataset; the original tmQM repository
(https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds to the data committed
on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13).
Each system in the tmQM database was evaluated using gfn2-xtb, and then a short MD simulation performed to
provide additional configurations of the systems.
- The tblite package was used to evaluate the energetic of the system using the gfn2-xtb formalism.
- MD simulations were performed using the Atomic Simulation Environment (ASE), using the Langevin integrator.
- Simulations were performed at 300K with a 1 fs timestep and 0.01 1/fs friction damping factor.
- In all trajectories, the first configuration corresponds to the energy minimized configuration reported
in the original tmQM dataset.
- 100 steps were taken between snapshots (100 fs), with 30 total snapshots per system.
- During MD sampling, gfn2-xtb accuracy was set to 2; all reported properties were calculated at
gfn2-xtb accuracy level 1.
Scripts used to perform the sampling can be found https://github.com/chrisiacovella/xtb_config_gen"
remote_dataset:
doi: 10.5281/zenodo.15227144
gz_data_file:
length: 1793203012
md5: c133327bcf73182efecccaea51a34fdf
file_name: tmqm_xtb_dataset_PdZnFeCu_T300K_v1.1.hdf5.gz
hdf5_data_file:
md5: 0bbee004a633654963b57811b690b128
file_name: tmqm_xtb_dataset_PdZnFeCu_T300K_v1.1.hdf5
url: https://zenodo.org/records/15227144/files/tmqm_xtb_dataset_PdZnFeCu_T300_v1.1.hdf5.gz
PdZnFeCu_T300K_ncr10_v1.1:
hdf5_schema: 2
available_properties:
- positions
- atomic_numbers
- total_charge
- forces
- dipole_moment_per_system
- energies
- partial_charges
about: "This dataset contains 23177 unique systems with 225,571 total configurations with a maximum number
of 10 configurations per system, sampled at T=300K. While 30 configurations were generated,
this was restricted to only be 10 maximum per system for comparison to the T=100K data where
only 10 configurations were generated.
This dataset is limited to systems that contain transition metals Pd, Zn, Fe, or Cu,
and also only contain elements C, H, P, S, O, N, F, Cl, or Br.
Potentially problematic configurations (i.e., unstable or those with structural changes) were removed.
Briefly, bond inference was performed on the initial configuration using RDKit and a configuration was
excluded if any of those bond distances changed by more than 0.15 angstroms compared to the initial,
energy minimized state.
This dataset was generated starting from the tmQM dataset; the original tmQM repository
(https://github.com/uiocompcat/tmQM) was forked and a release made that corresponds to the data committed
on 13 August 2024 (https://github.com/chrisiacovella/tmQM/releases/tag/2024Aug13).
Each system in the tmQM database was evaluated using gfn2-xtb, and then a short MD simulation performed to
provide additional configurations of the systems.
- The tblite package was used to evaluate the energetic of the system using the gfn2-xtb formalism.
- MD simulations were performed using the Atomic Simulation Environment (ASE), using the Langevin integrator.
- Simulations were performed at 300K with a 1 fs timestep and 0.01 1/fs friction damping factor.
- In all trajectories, the first configuration corresponds to the energy minimized configuration reported
in the original tmQM dataset.
- 100 steps were taken between snapshots (100 fs), with 30 total snapshots per system.
- During MD sampling, gfn2-xtb accuracy was set to 2; all reported properties were calculated at
gfn2-xtb accuracy level 1.
Scripts used to perform the sampling can be found https://github.com/chrisiacovella/xtb_config_gen"
remote_dataset:
doi: 10.5281/zenodo.15227237
gz_data_file:
length: 831615242
md5: f7c6dca18f52d99253cbdd74fe540032
file_name: tmqm_xtb_dataset_PdZnFeCu_T300K_ncr_10_v1.1.hdf5.gz
hdf5_data_file:
md5: 60219cdee1c975f3eef2a193d16a3dcc
file_name: tmqm_xtb_dataset_PdZnFeCu_T300K_ncr_10_v1.1.hdf5
url: https://zenodo.org/records/15227237/files/tmqm_xtb_dataset_PdZnFeCu_T300_first10_v1.1.hdf5.gz