{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "2b1e109f-b2f5-42a8-bcc3-0acf2d7af9a0",
   "metadata": {},
   "source": [
    "# modelforge.curate : Basic Usage\n",
    "\n",
    "This notebook will demonstrate basic usage of the `curate` module in modelforge, developed to make it easier to create modelforge compatible HDF5 datasets with a uniform structure.  This module puts an emphasis of validation at the time of construction. \n",
    "\n",
    "In the curate module, we have 3 levels of hierarchy: \n",
    "\n",
    "- At the top most level, we have a dataset (i.e., an instance of `SourceDataset`)\n",
    "- A dataset contains records (instances of the `Record` class)\n",
    "- Each record contains properties (each property is defined as a Pydantic model that is a child of the `PropertyBaseModel` class)\n",
    "\n",
    "To start, let us import the packages we need\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "6cf4ef4a-d9b9-49bc-bd8a-3f5e62953c5c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from modelforge.curate import Record, SourceDataset\n",
    "from modelforge.utils.units import GlobalUnitSystem\n",
    "from modelforge.curate import AtomicNumbers, Positions, Energies, Forces, MetaData\n",
    "\n",
    "from openff.units import unit\n",
    "\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bcfb4ef7-18f9-41dd-9b08-14bb8afd88d7",
   "metadata": {},
   "source": [
    "## Set up a new dataset\n",
    "Next, we will create an instance of the `SourceDataset` class to store the dataset, providing a `name` for the dataset as a string. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "bad426d0-eb06-44b3-8ec5-7bb7cb016e0a",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32m2025-07-16 15:27:42.598\u001b[0m | \u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[36mmodelforge.curate.sourcedataset\u001b[0m:\u001b[36m__init__\u001b[0m:\u001b[36m66\u001b[0m - \u001b[33m\u001b[1mDatabase file test_dataset.sqlite already exists in ./. Removing it.\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "new_dataset = SourceDataset(name=\"test_dataset\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aae624b3-f5f1-4543-990d-be2a79697b04",
   "metadata": {},
   "source": [
    "## Create a record\n",
    "\n",
    "To create a record, we  instantiate the `Record` class providing a unique `name` as a string; this `name` will be used within the dataset to access/update records and thus needs to be unique."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "58f90876-3bc8-4e07-b2f7-b7e65242067b",
   "metadata": {},
   "outputs": [],
   "source": [
    "record_mol1 = Record(name='mol1')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e37ac0df-ff9d-4a73-ae5a-eef94ba7f17e",
   "metadata": {},
   "source": [
    "## Define properties\n",
    "The curate packages provides pydantic models for many common properties reported in datasets. \n",
    "\n",
    "Each record must include a few basic elements to be considered complete, namely:\n",
    "- **atomic numbers**\n",
    "- **positions**\n",
    "- **energies**\n",
    "  \n",
    "Records may of course contain other properties/metadata, but this is the minimal set of information required by modelforge during training. \n",
    "\n",
    "### Property classifications\n",
    "Properties are classified into four categories. These categories are used to validate the inputs (including the shape of the underlying arrays).  These classifications are also used within modelforge to know how to parse information from the dataset (as the shape of the associated array itself may not be sufficient).  \n",
    "\n",
    "The four categories as as follows: \n",
    "- **atomic_numbers** -- array must have a shape (n_atoms,1). Regardless of the number of configurations in a property, the atomic numbers must be consistent, and thus do not need to be defined separately for each configuration.\n",
    "- **per_system** -- array must have a at least 2 dimensions, where the first dimension corresponds to the configuration, i.e., (n_configs, X). Energy is an example of a per_system property with shape (n_configs, 1)\n",
    "- **per_atom** -- array must have at least 3 dimensions, where the first two dimensions correspond to the configuration, and the second the atom, i.e., (n_configs, n_atoms, X). Partial charge is an example of a per_atom property with shape (n_config, n_atoms, 1)\n",
    "- **meta_data** -- there are no shape requirements for meta_data, however, input is limited to the following types: string, float, int, list, numpy array\n",
    "\n",
    "Users do not need to set the classification of a property for the pre-defined models within the module; the appropriate value is defined by default. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2b381c6d-c740-48bb-af75-1a4c712bba8a",
   "metadata": {},
   "source": [
    "### Defining atomic numbers\n",
    "Let us consider how to initialize atomic numbers, in this case for a methane CH4:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "587ce24b-4d72-40fb-a850-d0f85079178b",
   "metadata": {},
   "outputs": [],
   "source": [
    "atomic_numbers = AtomicNumbers(value=np.array([[6], [1], [1], [1], [1]]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00f90255-9b73-44bc-a24b-bd21eabf6e32",
   "metadata": {},
   "source": [
    "The array that stores the atomic numbers must have the shape (n_atoms, 1).  An error will be raised if `len(value.shape) != 2` or `value.shape[1] != 1`. \n",
    "\n",
    "Properties can accept either a numpy array or a python list as input; the python list it will be converted to a numpy array automatically.  For example, the following syntax will produce an equivalent instance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "3e19eb07-cb8d-4d5e-b54e-c18df911c697",
   "metadata": {},
   "outputs": [],
   "source": [
    "atomic_numbers = AtomicNumbers(value=[[6], [1], [1], [1], [1]])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ebefc0a9-9cde-4488-9f05-780c7856564d",
   "metadata": {},
   "source": [
    "### Defining positions\n",
    "\n",
    "To define positions, we will use the `Positions` pydantic model.  Since positions must have units associated with them, they must be set at the time of initialization.  The property models do not include a default unit, unless the property does not require units (e.g., such as `AtomicNumbers`). \n",
    "\n",
    "Units can be passed as an openff.units `Unit` or a string that can be understood by openff.units. An error will be raised if units are not defined, or if the units passed are not compatible (i.e., not a length measurement for `Positions`).\n",
    "\n",
    "Positions are a \"per_atom\" property storing the x,y, and z positions and thus must be 3d array with shape (n_configs, n_atoms, 3).\n",
    "If `value.shape[2] !=3` or `len(value.shape) != 3`, this will raise an error.  \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "a5bcd450-3d52-4a45-ab1a-f107e11df202",
   "metadata": {},
   "outputs": [],
   "source": [
    "positions = Positions(\n",
    "    value=np.array([[[0.0, 0.0, 0.0], [0.109, 0.0, 0.0], [-0.054, 0.094, 0.0], [-0.054, -0.094, 0.0], [0.0, 0.0, 0.109]]]), \n",
    "    units=\"nanometer\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9267df83-67d6-4a12-8ac3-d9a72a67b147",
   "metadata": {},
   "source": [
    "We can easily examine the positions, where we can see the `value`, `units`, `classification` and `property_type` (used to ensure unit compatibility); this also will determine the `n_configs` and `n_atoms` based on the shape of the underlying array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "711bec10-63ec-44a8-a368-f38bd3bb3577",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "name='positions' value=array([[[ 0.   ,  0.   ,  0.   ],\n",
      "        [ 0.109,  0.   ,  0.   ],\n",
      "        [-0.054,  0.094,  0.   ],\n",
      "        [-0.054, -0.094,  0.   ],\n",
      "        [ 0.   ,  0.   ,  0.109]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=1 n_atoms=5\n"
     ]
    }
   ],
   "source": [
    "print(positions)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "144b4f9e-62b8-42b5-8e79-429e133574b4",
   "metadata": {},
   "source": [
    "### Defining energies \n",
    "To define energies, we will use the `Energies` pydantic model; as with positions, units must also be set.  \n",
    "\n",
    "Note, energy is a \"per_system\" property and thus the shape of the input array must be (n_configs, 1); an error will be raised if `value.shape[1] !=1` or `len(value.shape) != 2`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "e03c8a4f-a49f-444f-8b3c-252d7373e77a",
   "metadata": {},
   "outputs": [],
   "source": [
    "energies = Energies(\n",
    "    value=np.array([[0.1]]), \n",
    "    units=unit.hartree\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9261ac4-ba1a-4b61-8ec6-14e6be08860c",
   "metadata": {},
   "source": [
    "### Definiting meta data\n",
    "\n",
    "We can also provide meta data in the form of int, float, str, list, or numpy arrays.  These properties do not necessarily undergo any significant validation as this information is not used directly by modelforge. \n",
    "\n",
    "Below is an example of using the MetaData class to define the smiles of the molecule, passed as a string. \n",
    "\n",
    "Note, a SMILES property class could be defined that includes validation, e.g., passing the string to RDKit to ensure it is valid, however this has not been implemented at the current time. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "49c9e150-a07f-4b31-bd77-654003983912",
   "metadata": {},
   "outputs": [],
   "source": [
    "smiles = MetaData(name='smiles', value='C')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d21b6dc0-7fa4-4a72-aa01-57326899a7b2",
   "metadata": {},
   "source": [
    "### Other properties\n",
    "\n",
    "Pydantic models have been defined for the following properties at the time of creating this tutorial:\n",
    "- `atomic_numbers`\n",
    "- `Energies`\n",
    "- `Positions`\n",
    "- `Forces`\n",
    "- `PartialCharges`\n",
    "- `TotalCharge`\n",
    "- `SpinMultiplicitiesPerSystem`\n",
    "- `DipoleMomentPerSystem`\n",
    "- `DipoleMomentPerAtom`\n",
    "- `DipoleMomentScalarPerSystem`\n",
    "- `QuadrupoleMomentPerSystem`\n",
    "- `QuadrupoleMomentPerAtom`\n",
    "- `OctupoleMomentPerAtom`\n",
    "- `Polarizability`\n",
    "- `BondOrders`\n",
    "- `MetaData`\n",
    "\n",
    "Note, each of these models inherits from a more general `PropertyBaseClass` pydantic model; this model can be used to define any additional properties, but requires the user to provide the classification (e.g., per_atom, per_system) and the property_type (for the purposes of unit conversion, e.g., length, energy, force, charge, etc.). \n",
    "\n",
    "Classes for additional properties can be added to the module as well; this set was generated based on what was encountered within the current datasets supported by `modelforge`.\n",
    "\n",
    "**More information on defining properties is provided in the \"defining_properties.ipynb\" notebook.**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d42293f-4a96-42ef-adb5-8fe12c6ef045",
   "metadata": {},
   "source": [
    "## Add properties to a record\n",
    "\n",
    "Having defined properties we can now add them to the record. Properties can be added individually to the record or provided as a list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "72291451-a62f-4780-b981-8fcd4127a9f1",
   "metadata": {},
   "outputs": [],
   "source": [
    "record_mol1.add_property(atomic_numbers)\n",
    "record_mol1.add_properties([positions,energies, smiles])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99ee9344-8ba6-4db8-aaa2-5a2811bde1bc",
   "metadata": {},
   "source": [
    "By default when instantiating a new `Record` instance, `append_property = False`.\n",
    "If `append_property == False`, an error will be raised if you try to add a property with the same name more than once to the same record. This ensures we do not accidentally overwrite data in a record.\n",
    "\n",
    "They following will produce a ValueError because \"energies\" have already been set for the record.  Note, in all cases, atomic_numbers can only be set once, regardless of the state of append_property, as it is a unique case. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "caf4f58d-502a-48ee-baec-e1fb99e094b0",
   "metadata": {},
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "Property with name energies already exists in the record mol1.Set append_property=True to append to the existing property.",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[11], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mrecord_mol1\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43madd_property\u001b[49m\u001b[43m(\u001b[49m\u001b[43menergies\u001b[49m\u001b[43m)\u001b[49m\n",
      "File \u001b[0;32m~/PycharmProjects/modelforge/modelforge-curate/modelforge/curate/record.py:433\u001b[0m, in \u001b[0;36mRecord.add_property\u001b[0;34m(self, property)\u001b[0m\n\u001b[1;32m    429\u001b[0m     error_msg \u001b[38;5;241m=\u001b[39m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mProperty with name \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mproperty\u001b[39m\u001b[38;5;241m.\u001b[39mname\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m already exists in the record \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mname\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m    430\u001b[0m     error_msg \u001b[38;5;241m+\u001b[39m\u001b[38;5;241m=\u001b[39m (\n\u001b[1;32m    431\u001b[0m         \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mSet append_property=True to append to the existing property.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m    432\u001b[0m     )\n\u001b[0;32m--> 433\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(error_msg)\n\u001b[1;32m    435\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m (\n\u001b[1;32m    436\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mper_system[\u001b[38;5;28mproperty\u001b[39m\u001b[38;5;241m.\u001b[39mname]\u001b[38;5;241m.\u001b[39mvalue\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m1\u001b[39m]\n\u001b[1;32m    437\u001b[0m     \u001b[38;5;241m==\u001b[39m \u001b[38;5;28mproperty\u001b[39m\u001b[38;5;241m.\u001b[39mvalue\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m1\u001b[39m]\n\u001b[1;32m    438\u001b[0m )\n\u001b[1;32m    439\u001b[0m temp_array \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mproperty\u001b[39m\u001b[38;5;241m.\u001b[39mvalue\n",
      "\u001b[0;31mValueError\u001b[0m: Property with name energies already exists in the record mol1.Set append_property=True to append to the existing property."
     ]
    }
   ],
   "source": [
    "record_mol1.add_property(energies)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81920b1c-1101-4148-9051-2a7c40f8b5ef",
   "metadata": {},
   "source": [
    "### Validating a record\n",
    "\n",
    "The use of pydantic allows for considerable validation at the time of initialization of the properties, e.g., ensuring units have been set, compatibility of units, and minimal examination of the shape of the input array.  However, since each property is defined separately, we are unable to cross validate n_atoms and n_configs until those properties are grouped into a record. \n",
    "\n",
    "An individual record can be validated to ensure consistency of n_configs and n_atoms.  Since this minimal only has 3 properties, this checks that n_atoms in `atomic_numbers` matches `positions` and n_configs matches in `energies` and `positions`. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "a00a9191-1345-4923-8a9a-f052b8a518db",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "record_mol1.validate()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "962b873e-587b-47c9-be6c-3af3ced73cea",
   "metadata": {},
   "source": [
    "### Viewing a record\n",
    "Printing a record provides a summary of the contents.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "cf96ffd4-de45-427c-9937-19a3ab60bf41",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "name: mol1\n",
      "* n_atoms: 5\n",
      "* n_configs: 1\n",
      "* atomic_numbers:\n",
      " -  name='atomic_numbers' value=array([[6],\n",
      "       [1],\n",
      "       [1],\n",
      "       [1],\n",
      "       [1]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=5\n",
      "* per-atom properties: (['positions']):\n",
      " -  name='positions' value=array([[[ 0.   ,  0.   ,  0.   ],\n",
      "        [ 0.109,  0.   ,  0.   ],\n",
      "        [-0.054,  0.094,  0.   ],\n",
      "        [-0.054, -0.094,  0.   ],\n",
      "        [ 0.   ,  0.   ,  0.109]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=1 n_atoms=5\n",
      "* per-system properties: (['energies']):\n",
      " -  name='energies' value=array([[0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None\n",
      "* meta_data: (['smiles'])\n",
      " -  name='smiles' value='C' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(record_mol1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84a3cbc9-c4c0-4990-8b0a-adb675a20e0c",
   "metadata": {},
   "source": [
    "The record can also be exported to a dict. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "38a2bed3-e521-49ae-a2ef-2ecd58d15b84",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'name': 'mol1',\n",
       " 'n_atoms': 5,\n",
       " 'n_configs': 1,\n",
       " 'atomic_numbers': AtomicNumbers(name='atomic_numbers', value=array([[6],\n",
       "        [1],\n",
       "        [1],\n",
       "        [1],\n",
       "        [1]]), units=<Unit('dimensionless')>, classification='atomic_numbers', property_type='atomic_numbers', n_configs=None, n_atoms=5),\n",
       " 'per_atom': {'positions': Positions(name='positions', value=array([[[ 0.   ,  0.   ,  0.   ],\n",
       "          [ 0.109,  0.   ,  0.   ],\n",
       "          [-0.054,  0.094,  0.   ],\n",
       "          [-0.054, -0.094,  0.   ],\n",
       "          [ 0.   ,  0.   ,  0.109]]]), units=<Unit('nanometer')>, classification='per_atom', property_type='length', n_configs=1, n_atoms=5)},\n",
       " 'per_system': {'energies': Energies(name='energies', value=array([[0.1]]), units=<Unit('hartree')>, classification='per_system', property_type='energy', n_configs=1, n_atoms=None)},\n",
       " 'meta_data': {'smiles': MetaData(name='smiles', value='C', units=<Unit('dimensionless')>, classification='meta_data', property_type='meta_data', n_configs=None, n_atoms=None)}}"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "record_mol1.to_dict()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3dd8692-cec0-4ef8-80b8-bd4999615637",
   "metadata": {},
   "source": [
    "Records can be converted to RDKit molecules. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "d18eaf81-3596-4be9-83fa-aafd514a9016",
   "metadata": {},
   "outputs": [],
   "source": [
    "rd_mol = record_mol1.to_rdkit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "af51c524-45c6-4d63-b366-7f1eea0b1d16",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAAABmJLR0QA/wD/AP+gvaeTAAAK9klEQVR4nO3de0zV9R/H8c/3cDBA0QgHx0U/NsNz6NiFVGquQxdzMqytdVmSZenstvxDECO0FZkZtFzSlnNJXnDpVmORf5jdNHMt10Y7hqntoOYEdgYIEnA4cjl8f3+czcrKA3y+3+/n4Hk+xhzg+4/XHOfF+3PO93vUdF0XAICxsqkOAADjGzUKAFKoUQCQQo0CgBRqFACkUKMAIMWuOgAU+OGHH7799lshRHJycklJScT5o0ePfv7550IITdPKy8tNzweMKxrXjcagioqKtWvXCiEcDoff7484v3PnzmXLloU/5wcGuAyHegCQQo0CgBRqFACkUKMAIIUaBQAp1CgASKFGAUAKNQoAUqhRAJDCzaAxrbW1derUqRHH+vv7LQgDjFPUaEzTdb2jo0N1CmB8o0Zjms1mmzZtWsSxvr6+CxcuWJAHGI+o0ZiWlpbW3Nwcceyvb00C4DK8xAQAUqhRAJBCjQKAFGoUAKRQowAghRoFACnUKABIoUYBQAo1CgBSqFEAkMLNoLHogQceSE9PF0IkJSWNZN7j8Wzbtk0IoWmaucmAcUjTdV11BgAYxzjUY0QGBwfff//9goICfu8Cl2EbxYj09vY6nU6/37979+7FixerjgNEEbZRjMikSZM2bNgghCgtLQ0EAqrjAFGEGsVIPfPMM7m5uS0tLe+++67qLEAU4VCPUThy5Mhdd92VkJBw8uTJzMxM1XGAqMA2ilGYO3fuokWLgsFgWVmZ6ixAtGAbxeg0NzdnZ2cHAoHDhw/n5eWpjgOoxzaK0cnIyCgpKRFCrFy5cnh4WHUcQD22UYxaMBjMzs4+d+7cjh07li5dqjoOoBg1irHYvXv3U089lZ6e7vP5Jk+erDoOoBKHeozF4sWLPR5Pa2trZWWl6iyAYmyjGKOff/75jjvusNvtx48fz8rKUh0HUIZtFGM0e/bsJ598cmBggIufEOPYRjF2LS0t2dnZvb2933zzzfz581XHAdRgG8XYXX/99a+88ooQYvXq1aFQSHUcQA22UUi5ePHiTTfddPbs2a1btz733HOq4wAKUKOQ9emnny5atCgtLc3n802ZMkV1HMBqHOoh6/HHH7/77rvb2treeust1VkABdhGYQCv1ztnzhy73X7s2DGn06k6DmAptlEY4Pbbb1+6dOnAwMDLL7+sOgtgNbZRGKOtrc3pdP7xxx9ffvllfn6+6jiAddhGYYy0tLQ1a9YIIVatWjU0NKQ6DmAdahSGKS4unjFjxokTJz788EPVWQDrcKiHkerq6h555JGUlJTGxsbU1FTVcQArsI3CSA8//PCCBQsuXLjw5ptvqs4CWIRtFAY7ceLEbbfdJoTwer0333yz6jiA6dhGYTC32/3ss88ODQ0VFxerzgJYgW0Uxuvs7JwxY0ZnZ+e+ffsWLlyoOg5gLmoUxqupqamrq9u7d29GRsbp06cnTJhw5fna2tqmpiYhhNvtjoZrTquqqsKPi/z8fLfbHXG+urq6t7dXCOHxeHJzc03Ph2ijA0ZzuVyXfsA2bdoUcf7ee+8NDz/99NMWxIvoUvgdO3aMZN7hcITn3377bZOjIRrx3CjMtW7duvPnz6tOAZiIGoWJpk+f3tXV9frrr6sOApiIGoWJ5s2bFx8fv3Xr1oaGBtVZALNQozBRamrqiy++GAqFuPgJVzG76gC4ym36+uuSuLjQwYMBh2PixIn/MpGYeGNi4iGrcwGGoUZholXbtsV1dGSGX/tubf2vsRUpKdusCwUYjEM9TJTS1SV0PdTQsNDlulGI6rIycfr03z727BFCaFy8jPGMGoXp4rKySjZvPiNEyebN/sREMX36nx/TpqlOB8iiRmGF+++//8EHH+zp6eHiJ1x9eG4UJgqFQvFCzJw586KmDQ4Oapr20Ucf7d+//5prrgkP3BkM7hGiu7tbbc7/Ulpaun79+ohj7e3tFoRB1KJGYSJd14UQv//+e/Av32xpabn0+f+EEEIMDw+Hv9y1a9euXbusy/d3DofD7/f/9Tvt7e1UJCKiRmG6SZMmxQkhhOjv7x8cHFScZjQSEhLs9siPkUAgoPMqWQzjuVGYKNxBbW1tPT09Xq/XZrPZbLb6+vpL7+nw3XffCSGuvfba8Lzatya5bBUVQmzZsqVnBNLT0y3+h0VUoUZhkaKiov7+/uXLl8+ePVt1FsBI1CiscODAgX379iUnJ/N/NOHqw3OjMF3o1KmNL700XYiyFSscfX3izJk//+4f52hg3KFGYSLdZhNCxN166/7w15WVorLyX8Y0zdJYgKGoUZiotqCg8JdfmpqaQqGQw+FISkr6l6HExM2JiaK+3vJ0gDGoUZioITv7yA03fPDBB/PmzTtw4MB/jZ2+7z4rUwHGokZhoo6Ojpqamri4uKqqKtVZALPwSj1MdPDgwcHBwRdeeOGWW25RnQUwCzUKE505cyYlJWXdunWqgwAmokZhrvLy8qlTp6pOAZhI07kXGEarqampq6vbu3dvZmZmY2NjfHz8ledra2ubmpqEEG63Oz8/35KMV1JVVRV+XOTn57vd7ojz1dXVvb29QgiPx5Obm2t6PkQZahTGa29vdzqdXV1dX3zxRUFBgeo4gLk41MN4r732WldX1/z58+lQxAK2URjs+PHjOTk5QoijR4/OnDlTdRzAdGyjMFhxcfHQ0NCKFSvoUMQItlEY6bPPPnv00Uevu+46n8+XmpqqOg5gBbZRGGZgYKCsrEwIsX79ejoUsYMahWHee++9xsZGt9v9/PPPq84CWIdDPYzR2trqdDq7u7u/+uqrBQsWqI4DWIdtFMZYs2ZNd3f3Qw89RIci1rCNwgBer3fOnDl2u/3YsWNOp1N1HMBSbKMwQFFR0fDw8MqVK+lQxCC2Ucj65JNPCgsL09LSfD7flClTVMcBrMY2CinBYDB8kdOGDRvoUMQmahRSNm7cePbs2ZycnGXLlqnOAqjBoR5j19LS4nK5AoHAoUOH7rnnHtVxADXYRjF2ZWVlgUDgscceo0MRy9hGMUY//fTT3LlzJ0yY8Ouvv2ZlZamOAyjDNoqx0HW9pKQk/CcdihjHNoqx+Pjjj5csWZKenu7z+SZPnqw6DqAS2yhGLRgMvvrqq0KId955hw4FqFGMWkVFxblz52bNmrVkyRLVWQD1ONRjdJqbm10uVzAY/P777/Py8lTHAdRjG8XorF69uq+vr7CwkA4FwthGMQo//vijx+NJSEg4efJkZmam6jhAVGAbxUgNDw8XFRXpul5aWkqHApewjWKktm/fvnz58oyMjN9++23ixImq4wDRghrFiPT09LhcLr/fv2fPnieeeEJ1HCCKxL3xxhuqM2Ac0DTNbrfbbLaKigpN01THAaII2ygASLGrDgAFGhoa6uvrhRBJSUmFhYUR50+dOnX48GEhhKZpvK8ocBm20VhUUVGxdu1aIYTD4fD7/RHnd+7ceak9+YEBLsMFTwAghRoFACnUKABIoUYBQAo1CgBSqFEAkEKNAoAUahQApFCjACCFm0Fj2vnz52fNmhVxrLOz04IwwDhFjca0oaEhr9erOgUwvnGoBwAp1GhMczgcoRHYvn276qRA9OJQH+tstsi/SnmfZuAK2EYBQAo1CgBSqFEAkEKNAoAUahQApFCjACCFGgUAKdQoAEihRgFACncxxaK8vLzy8nIhRHJy8kjmc3JywvPczgT8k6bruuoMADCOcagHACnUKABIoUYBQAo1CgBSqFEAkEKNAoCU/wPxDy1LWvfXLAAAAGp6VFh0cmRraXRQS0wgcmRraXQgMjAyNC4wMy41AAB4nHu/b+09BiAQAGJGBghgBWIWIG5gZFPQAAmwMEJoRlw0N1ArEwMDM1ifE0hI3A1qICPMRDiQb+22h7ELbbn2X19cYI/E3o+uVgwAXR8Qu7bnLOQAAAC8elRYdE1PTCByZGtpdCAyMDI0LjAzLjUAAHichZBBDsIgEEX3nOJfoM2MhQXLAo0aU5ooegf33j8OKtJGrQOLYXjM/I9CjmM4XG94RxeUAmhlW2tx6YhIjcgJ3LDdR/jUu1Lx0zmmEwy0vJC1JPs0jaXC8KCWHvEtKdwGO3BL9h/XCddQa/Tr3uofoF6AzQpphPycWMUUbohhYe1p1k0xVLMsRrh6YtHLVTqLKq4CWUbzvPu8Vz6XL5dc3QFG9FbwxxAzYQAAAE56VFh0U01JTEVTIHJka2l0IDIwMjQuMDMuNQAAeJyL9oh11oj2iNUEE0CsUKNhqGdgqWOgY20AInQN9ExNdAz0LE1gbF0IByQLUqlZAwAtZg8eJTQRdQAAAABJRU5ErkJggg==",
      "text/plain": [
       "<rdkit.Chem.rdchem.RWMol at 0x7fcb2e70f420>"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rd_mol"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "da729abe-5559-466a-8c1a-54de47957969",
   "metadata": {},
   "source": [
    "## Add a record to a dataset\n",
    "\n",
    "To add a record to the dataset, we use the `add_record` function of `SourceDataset`.\n",
    "\n",
    "Note, the `name` field of the record is used as a unique identifier.  You cannot add two records with the same `name`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "56329f06-7440-44e9-82aa-4a1a38e1b874",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "new_dataset.add_record(record_mol1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c58e523c-68bf-4271-a431-ea7e7532dae5",
   "metadata": {},
   "source": [
    "The entire dataset can validated.  This essentially just calls the validate function on the individual records, as well as ensure that the minimal set of properties exist (atomic_numbers, energies, and positions). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "bac9ccc5-867a-4f54-a2ea-4f74cebb6cf0",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 357.88it/s]\n",
      "\u001b[32m2025-07-16 15:30:15.940\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[36mmodelforge.curate.sourcedataset\u001b[0m:\u001b[36mvalidate_records\u001b[0m:\u001b[36m857\u001b[0m - \u001b[1mAll records validated successfully.\u001b[0m\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_dataset.validate_records()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14815ba2-8dac-4d0b-95c6-013f9403fe99",
   "metadata": {},
   "source": [
    "## Saving to an HDF5 file\n",
    "\n",
    "To save ths to an hdf5 file, we call the `to_hdf5` function of the `SourceDataset` class, passing the output path and filename. This will automatically perform the validation discussed above before we write to the file. \n",
    "\n",
    "Additionally, when writing the file, it will convert records to a consistent unit system (by default, `kilojoules_per_mole` and `nanometers` are the base unit system for energy and distance), as defined by the `GlobalUnitSystem` class (discussed below).\n",
    "\n",
    "Note this returns the md5 checksum of the file. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "5f32cd4f-330d-4ebf-a27f-1a36b6751103",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32m2025-07-16 15:30:18.502\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[36mmodelforge.curate.sourcedataset\u001b[0m:\u001b[36mto_hdf5\u001b[0m:\u001b[36m990\u001b[0m - \u001b[1mValidating records\u001b[0m\n",
      "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 360.68it/s]\n",
      "\u001b[32m2025-07-16 15:30:18.507\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[36mmodelforge.curate.sourcedataset\u001b[0m:\u001b[36mvalidate_records\u001b[0m:\u001b[36m857\u001b[0m - \u001b[1mAll records validated successfully.\u001b[0m\n",
      "\u001b[32m2025-07-16 15:30:18.507\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[36mmodelforge.curate.sourcedataset\u001b[0m:\u001b[36mto_hdf5\u001b[0m:\u001b[36m993\u001b[0m - \u001b[1mWriting records to .//test_dataset.hdf5\u001b[0m\n",
      "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 439.65it/s]\n"
     ]
    }
   ],
   "source": [
    "checksum = new_dataset.to_hdf5(file_path=\"./\", file_name=\"test_dataset.hdf5\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49fe5937-552e-4948-a3f1-559c65181184",
   "metadata": {},
   "source": [
    "## Reading from an HDF5 file\n",
    "\n",
    "Any hdf5 files generated with the `modelforge.curate` module can also be loaded into an instance of `SourceDataset` using the `create_dataset_from_hdf5` function. \n",
    "\n",
    "Note, the HDF5 files do not retain information about the class of the property, i.e., it does not know that position data came from an instance of the `Positions` pydantic class; the name given to a property has no restrictions and thus one may not be able to infer class type. \n",
    "\n",
    "To recreate this information, we can provide a simple dictionary that maps a given property name to the property class.  If this property mapping is not provided, all data will be loaded into instances of the parent class `PropertyBaseModel`.  While the `PropertyBaseModel` contains the same information (i.e., name, value, units, classification, property_type), it does not contain any validators specific to a given property and may raise warnings if the dataset is written back to file. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "098624a0-c6a3-40b1-9f32-da3799fdfc01",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32m2025-07-16 15:30:20.642\u001b[0m | \u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[36mmodelforge.curate.sourcedataset\u001b[0m:\u001b[36m__init__\u001b[0m:\u001b[36m66\u001b[0m - \u001b[33m\u001b[1mDatabase file test_dataset2.sqlite already exists in ./. Removing it.\u001b[0m\n",
      "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 118.00it/s]\n"
     ]
    }
   ],
   "source": [
    "from modelforge.curate import create_dataset_from_hdf5\n",
    "\n",
    "property_map = {\"atomic_numbers\": AtomicNumbers, \"positions\": Positions, \"energies\": Energies}\n",
    "new_dataset_from_hdf5 = create_dataset_from_hdf5(hdf5_filename=\"test_dataset.hdf5\", dataset_name=\"test_dataset2\", property_map=property_map)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4803c448-8e18-485d-bd7a-813a3e6fe436",
   "metadata": {},
   "source": [
    "If we examine the original dataset and the one loaded from the hdf5 file we will see they contain the same information. \n",
    "\n",
    "Note that, because we convert units when we write to file (as discussed above), the \"energies\" property is in units of `kilojoules_per_mole` for the dataset read in from the file. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "1b23c1b6-6fda-49b9-8144-f8de08ca42c3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "original dataset\n",
      "\n",
      " -records: ['mol1']\n",
      " -total records:  1\n",
      " -total configs:  1\n",
      " -record mol1: \n",
      "\n",
      " name: mol1\n",
      "* n_atoms: 5\n",
      "* n_configs: 1\n",
      "* atomic_numbers:\n",
      " -  name='atomic_numbers' value=array([[6],\n",
      "       [1],\n",
      "       [1],\n",
      "       [1],\n",
      "       [1]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=5\n",
      "* per-atom properties: (['positions']):\n",
      " -  name='positions' value=array([[[ 0.   ,  0.   ,  0.   ],\n",
      "        [ 0.109,  0.   ,  0.   ],\n",
      "        [-0.054,  0.094,  0.   ],\n",
      "        [-0.054, -0.094,  0.   ],\n",
      "        [ 0.   ,  0.   ,  0.109]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=1 n_atoms=5\n",
      "* per-system properties: (['energies']):\n",
      " -  name='energies' value=array([[0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None\n",
      "* meta_data: (['smiles'])\n",
      " -  name='smiles' value='C' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None\n",
      "\n",
      "***************\n",
      "\n",
      "dataset generated from hdf5\n",
      "\n",
      " -records: ['mol1']\n",
      " -total records:  1\n",
      " -total configs:  1\n",
      " -record mol1: \n",
      "\n",
      " name: mol1\n",
      "* n_atoms: 5\n",
      "* n_configs: 1\n",
      "* atomic_numbers:\n",
      " -  name='atomic_numbers' value=array([[6],\n",
      "       [1],\n",
      "       [1],\n",
      "       [1],\n",
      "       [1]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=5\n",
      "* per-atom properties: (['positions']):\n",
      " -  name='positions' value=array([[[ 0.   ,  0.   ,  0.   ],\n",
      "        [ 0.109,  0.   ,  0.   ],\n",
      "        [-0.054,  0.094,  0.   ],\n",
      "        [-0.054, -0.094,  0.   ],\n",
      "        [ 0.   ,  0.   ,  0.109]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=1 n_atoms=5\n",
      "* per-system properties: (['energies']):\n",
      " -  name='energies' value=array([[262.54996395]]) units=<Unit('kilojoule_per_mole')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None\n",
      "* meta_data: (['smiles'])\n",
      " -  name='smiles' value='C' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(\"original dataset\\n\")\n",
    "print(\" -records:\", new_dataset.keys())\n",
    "print(\" -total records: \",new_dataset.total_records())\n",
    "print(\" -total configs: \", new_dataset.total_configs())\n",
    "print(\" -record mol1: \\n\\n\", new_dataset.get_record(\"mol1\"))\n",
    "\n",
    "print(\"***************\\n\")\n",
    "print(\"dataset generated from hdf5\\n\")\n",
    "print(\" -records:\", new_dataset_from_hdf5.keys())\n",
    "print(\" -total records: \",new_dataset_from_hdf5.total_records())\n",
    "print(\" -total configs: \", new_dataset_from_hdf5.total_configs())\n",
    "print(\" -record mol1: \\n\\n\", new_dataset_from_hdf5.get_record(\"mol1\"))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6e657da-ca12-4e7f-9a52-eafb088dee03",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "70d3c511-e525-4eee-bd5c-3aa00b38847d",
   "metadata": {},
   "source": [
    "## Global Unit system and unit validation\n",
    "\n",
    "When defining individual properties, units are also validated.  When defining a property, users can specify any unit that is:\n",
    "- (1) supported by openff.units\n",
    "- (2) compatible with the parameter type (i.e., Positions expect a unit of length).\n",
    "\n",
    "Bullet 2 is assessed by comparing to the default values in the `GlobalUnitSystem` class (note, we are not making any unit conversions at the point of initializing a record, just checking for compatibility). \n",
    "\n",
    "The following will fail validation because we expect positions to be defined in distance units. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "08fe229e-be4c-4648-80d8-8f0143aa8021",
   "metadata": {},
   "outputs": [
    {
     "ename": "ValidationError",
     "evalue": "1 validation error for Positions\n  Value error, Unit angstrom ** 2 of positions are not compatible with the property type length.\n [type=value_error, input_value={'value': [[[1.0, 1.0, 1....<Unit('angstrom ** 2')>}, input_type=dict]\n    For further information visit https://errors.pydantic.dev/2.8/v/value_error",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValidationError\u001b[0m                           Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[22], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m pos \u001b[38;5;241m=\u001b[39m \u001b[43mPositions\u001b[49m\u001b[43m(\u001b[49m\u001b[43mvalue\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m[\u001b[49m\u001b[43m[\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m1.0\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m1.0\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m1.0\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m2.0\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m2.0\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m2.0\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m3.0\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m3.0\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m3.0\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m]\u001b[49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43munits\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43munit\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mangstrom\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43munit\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mangstrom\u001b[49m\u001b[43m)\u001b[49m\n",
      "File \u001b[0;32m~/anaconda3/envs/modelforge311/lib/python3.11/site-packages/pydantic/main.py:193\u001b[0m, in \u001b[0;36mBaseModel.__init__\u001b[0;34m(self, **data)\u001b[0m\n\u001b[1;32m    191\u001b[0m \u001b[38;5;66;03m# `__tracebackhide__` tells pytest and some other tools to omit this function from tracebacks\u001b[39;00m\n\u001b[1;32m    192\u001b[0m __tracebackhide__ \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mTrue\u001b[39;00m\n\u001b[0;32m--> 193\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m__pydantic_validator__\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mvalidate_python\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mself_instance\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m)\u001b[49m\n",
      "\u001b[0;31mValidationError\u001b[0m: 1 validation error for Positions\n  Value error, Unit angstrom ** 2 of positions are not compatible with the property type length.\n [type=value_error, input_value={'value': [[[1.0, 1.0, 1....<Unit('angstrom ** 2')>}, input_type=dict]\n    For further information visit https://errors.pydantic.dev/2.8/v/value_error"
     ]
    }
   ],
   "source": [
    "pos = Positions(value=[[[1.0, 1.0, 1.0], [2.0, 2.0, 2.0], [3.0, 3.0, 3.0]]], units=unit.angstrom*unit.angstrom)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4085f29a-b005-4e2b-8dcb-86c58e1fca19",
   "metadata": {},
   "source": [
    "Units are stored as class attributes within the `GlobalUnitSystem` class. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "4684a979-8564-4639-96c4-806c0ab55a8e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "area : nanometer ** 2\n",
      "atomic_numbers : dimensionless\n",
      "charge : elementary_charge\n",
      "dimensionless : dimensionless\n",
      "dipole_moment : elementary_charge * nanometer\n",
      "energy : kilojoule_per_mole\n",
      "force : kilojoule_per_mole / nanometer\n",
      "frequency : gigahertz\n",
      "heat_capacity : kilojoule_per_mole / kelvin\n",
      "length : nanometer\n",
      "name : default\n",
      "octupole_moment : elementary_charge * nanometer ** 3\n",
      "polarizability : nanometer ** 3\n",
      "quadrupole_moment : elementary_charge * nanometer ** 2\n",
      "wavenumber : 1 / centimeter\n"
     ]
    }
   ],
   "source": [
    "from modelforge.utils.units import GlobalUnitSystem\n",
    "print(GlobalUnitSystem())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7fbcb58d-ac94-48ae-b234-dbf07b644f33",
   "metadata": {},
   "source": [
    "Since these are class attributes, not instance variables, any changes or additions to the `GlobalUnitSystem `will apply to all usages within the script. For example, the following will change the units for length to angstroms. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "13a1259d-3b2c-46df-b7ce-0d26ea572282",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "angstrom\n"
     ]
    }
   ],
   "source": [
    "GlobalUnitSystem.set_global_units('length', unit.angstrom)\n",
    "\n",
    "print(GlobalUnitSystem.get_units('length'))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7891f20e-2a77-431f-9c2d-53ba18725d35",
   "metadata": {},
   "source": [
    "The `set_global_units` function can also be used to add in a new property_type and associated units.  For example, the following would add pressure as a possible property_type. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "97b5878b-9ad7-4bef-97af-79031c5ea269",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "standard_atmosphere\n"
     ]
    }
   ],
   "source": [
    "GlobalUnitSystem.set_global_units('pressure', unit.atmosphere)\n",
    "\n",
    "print(GlobalUnitSystem.get_units('pressure'))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2269126d-10c9-482b-bf45-91fa906b2207",
   "metadata": {},
   "source": [
    "Changing the global unit system, e.g., making the nonsensical choice to set length to an energy unit, results in the validation to fail when defining positions with the units of angstrom. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "a1fcc5e4-350c-4b37-bbcf-49d77f9c017a",
   "metadata": {},
   "outputs": [
    {
     "ename": "ValidationError",
     "evalue": "1 validation error for Positions\n  Value error, Unit angstrom of positions are not compatible with the property type length.\n [type=value_error, input_value={'value': [[[1.0, 1.0, 1....ts': <Unit('angstrom')>}, input_type=dict]\n    For further information visit https://errors.pydantic.dev/2.8/v/value_error",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValidationError\u001b[0m                           Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[26], line 2\u001b[0m\n\u001b[1;32m      1\u001b[0m GlobalUnitSystem\u001b[38;5;241m.\u001b[39mset_global_units(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mlength\u001b[39m\u001b[38;5;124m'\u001b[39m, unit\u001b[38;5;241m.\u001b[39mhartree)\n\u001b[0;32m----> 2\u001b[0m pos \u001b[38;5;241m=\u001b[39m \u001b[43mPositions\u001b[49m\u001b[43m(\u001b[49m\u001b[43mvalue\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m[\u001b[49m\u001b[43m[\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m1.0\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m1.0\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m1.0\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m2.0\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m2.0\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m2.0\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m3.0\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m3.0\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m3.0\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m]\u001b[49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43munits\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43munit\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mangstrom\u001b[49m\u001b[43m)\u001b[49m\n",
      "File \u001b[0;32m~/anaconda3/envs/modelforge311/lib/python3.11/site-packages/pydantic/main.py:193\u001b[0m, in \u001b[0;36mBaseModel.__init__\u001b[0;34m(self, **data)\u001b[0m\n\u001b[1;32m    191\u001b[0m \u001b[38;5;66;03m# `__tracebackhide__` tells pytest and some other tools to omit this function from tracebacks\u001b[39;00m\n\u001b[1;32m    192\u001b[0m __tracebackhide__ \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mTrue\u001b[39;00m\n\u001b[0;32m--> 193\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m__pydantic_validator__\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mvalidate_python\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mself_instance\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m)\u001b[49m\n",
      "\u001b[0;31mValidationError\u001b[0m: 1 validation error for Positions\n  Value error, Unit angstrom of positions are not compatible with the property type length.\n [type=value_error, input_value={'value': [[[1.0, 1.0, 1....ts': <Unit('angstrom')>}, input_type=dict]\n    For further information visit https://errors.pydantic.dev/2.8/v/value_error"
     ]
    }
   ],
   "source": [
    "GlobalUnitSystem.set_global_units('length', unit.hartree)\n",
    "pos = Positions(value=[[[1.0, 1.0, 1.0], [2.0, 2.0, 2.0], [3.0, 3.0, 3.0]]], units=unit.angstrom)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "29ebda4c-cb79-42ed-bb34-05a264427d08",
   "metadata": {},
   "outputs": [],
   "source": [
    "GlobalUnitSystem.set_global_units('length', unit.nanometer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "7309871d-48ae-4719-b684-777697a7f82e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 nanometer\n"
     ]
    }
   ],
   "source": [
    "print(10*GlobalUnitSystem.get_units('length'))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f23d946a-668c-4cbe-9e9e-1ff43742f066",
   "metadata": {},
   "source": [
    "When hdf5 files are generated, quantities are automatically convert to the units specified in the `GlobalUnitSystem`. Note, this is not an inplace transformation, it does not change the values of the underlying properties. \n",
    "\n",
    "\n",
    "### Converting units of properties stored in the dataset\n",
    "\n",
    "The following will perform an inplace unit conversion of a property (i.e., that updates the values stored in the property).  This will check for unit compatibility. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "3b63ca3b-b33d-4918-94e8-48116d9e80d9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "name='energies' value=array([[0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None\n",
      "name='energies' value=array([[62.75094741]]) units=<Unit('kilocalorie_per_mole')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None\n"
     ]
    }
   ],
   "source": [
    "# print out initial energies property\n",
    "\n",
    "print(energies)\n",
    "\n",
    "# perform unit conversion \n",
    "energies.convert_units(unit.kilocalorie_per_mole)\n",
    "\n",
    "print(energies)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41e00671-4f81-4f63-ad28-b55283edd859",
   "metadata": {},
   "source": [
    "The following will convert all properties in a record to the `GlobalUnitSystem`.  note this will not perform unit conversion for meta_data. Note, since `positions` are already in units of `nanometer` these will not be changed, but `energies`  will be converted from `hartrees` to `kilojoule_per_mole`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "0c3b71f8-be34-4e92-87d0-d6646b0b326f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "name: mol1\n",
      "* n_atoms: 5\n",
      "* n_configs: 1\n",
      "* atomic_numbers:\n",
      " -  name='atomic_numbers' value=array([[6],\n",
      "       [1],\n",
      "       [1],\n",
      "       [1],\n",
      "       [1]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=5\n",
      "* per-atom properties: (['positions']):\n",
      " -  name='positions' value=array([[[ 0.   ,  0.   ,  0.   ],\n",
      "        [ 0.109,  0.   ,  0.   ],\n",
      "        [-0.054,  0.094,  0.   ],\n",
      "        [-0.054, -0.094,  0.   ],\n",
      "        [ 0.   ,  0.   ,  0.109]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=1 n_atoms=5\n",
      "* per-system properties: (['energies']):\n",
      " -  name='energies' value=array([[0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None\n",
      "* meta_data: (['smiles'])\n",
      " -  name='smiles' value='C' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None\n",
      "\n",
      "name: mol1\n",
      "* n_atoms: 5\n",
      "* n_configs: 1\n",
      "* atomic_numbers:\n",
      " -  name='atomic_numbers' value=array([[6],\n",
      "       [1],\n",
      "       [1],\n",
      "       [1],\n",
      "       [1]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=5\n",
      "* per-atom properties: (['positions']):\n",
      " -  name='positions' value=array([[[ 0.   ,  0.   ,  0.   ],\n",
      "        [ 0.109,  0.   ,  0.   ],\n",
      "        [-0.054,  0.094,  0.   ],\n",
      "        [-0.054, -0.094,  0.   ],\n",
      "        [ 0.   ,  0.   ,  0.109]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=1 n_atoms=5\n",
      "* per-system properties: (['energies']):\n",
      " -  name='energies' value=array([[262.54996395]]) units=<Unit('kilojoule_per_mole')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None\n",
      "* meta_data: (['smiles'])\n",
      " -  name='smiles' value='C' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None\n",
      "\n"
     ]
    }
   ],
   "source": [
    "#first print out the record\n",
    "print(record_mol1)\n",
    "\n",
    "# convert units \n",
    "\n",
    "record_mol1.convert_to_global_unit_system()\n",
    "\n",
    "print(record_mol1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00c82a10-12ff-420e-99e8-3d17d00761a7",
   "metadata": {},
   "source": [
    "All records and properties in a dataset can be converted as well using the `convert_to_global_unit_system`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "3123e13c-9406-4bfd-9f0b-122732d3d567",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "name: mol1\n",
      "* n_atoms: 5\n",
      "* n_configs: 1\n",
      "* atomic_numbers:\n",
      " -  name='atomic_numbers' value=array([[6],\n",
      "       [1],\n",
      "       [1],\n",
      "       [1],\n",
      "       [1]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=5\n",
      "* per-atom properties: (['positions']):\n",
      " -  name='positions' value=array([[[ 0.   ,  0.   ,  0.   ],\n",
      "        [ 0.109,  0.   ,  0.   ],\n",
      "        [-0.054,  0.094,  0.   ],\n",
      "        [-0.054, -0.094,  0.   ],\n",
      "        [ 0.   ,  0.   ,  0.109]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=1 n_atoms=5\n",
      "* per-system properties: (['energies']):\n",
      " -  name='energies' value=array([[0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None\n",
      "* meta_data: (['smiles'])\n",
      " -  name='smiles' value='C' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None\n",
      "\n",
      "name: mol1\n",
      "* n_atoms: 5\n",
      "* n_configs: 1\n",
      "* atomic_numbers:\n",
      " -  name='atomic_numbers' value=array([[6],\n",
      "       [1],\n",
      "       [1],\n",
      "       [1],\n",
      "       [1]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=5\n",
      "* per-atom properties: (['positions']):\n",
      " -  name='positions' value=array([[[ 0.   ,  0.   ,  0.   ],\n",
      "        [ 0.109,  0.   ,  0.   ],\n",
      "        [-0.054,  0.094,  0.   ],\n",
      "        [-0.054, -0.094,  0.   ],\n",
      "        [ 0.   ,  0.   ,  0.109]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=1 n_atoms=5\n",
      "* per-system properties: (['energies']):\n",
      " -  name='energies' value=array([[262.54996395]]) units=<Unit('kilojoule_per_mole')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None\n",
      "* meta_data: (['smiles'])\n",
      " -  name='smiles' value='C' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# print the initial dataset \n",
    "print(new_dataset.get_record('mol1'))\n",
    "\n",
    "new_dataset.convert_to_global_unit_system()\n",
    "\n",
    "#print out after conversion\n",
    "print(new_dataset.get_record('mol1'))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5c635199-0e01-4d60-96a7-0e767a4f8e9c",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}