{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "2b1e109f-b2f5-42a8-bcc3-0acf2d7af9a0",
   "metadata": {},
   "source": [
    "# modelforge.curate : properties\n",
    "\n",
    "This notebook will focus on a more thorough examination of defining properties."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "6cf4ef4a-d9b9-49bc-bd8a-3f5e62953c5c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from modelforge.curate import Record, SourceDataset\n",
    "from modelforge.utils.units import GlobalUnitSystem\n",
    "from modelforge.curate.properties import AtomicNumbers, Positions, Energies, Forces, MetaData\n",
    "\n",
    "from openff.units import unit\n",
    "\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "afa88c4a-8428-4904-b77f-84732ce8ea66",
   "metadata": {},
   "source": [
    "## Properties\n",
    "\n",
    "Each property inherits from the `PropertyBaseClass` pydantic model and has the following fields:\n",
    "\n",
    "- `name` : str : unique identifier for the property\n",
    "- `value` : ndarray : array containing the values (note, the `MetaData` property allows this to be  set to a str, int, float, and list in addition to a numpy array) \n",
    "- `units` : unit.Unit : OpenFF.units \n",
    "- `classification` : PropertyClassification enum : specifies if the property is \"atomic_numbers\", \"per_atom\", \"per_system\", or \"meta_data\"\n",
    "- `property_type` : PropertyType enum: specifies the type of property (e.g., length, energy, force, etc.) used for validating the specified `units`\n",
    "\n",
    "`classification` and `property_type` are inherent to the property and do not need to be modified when a property is instantiated.  \n",
    "\n",
    "While a default value is set for `name` field for each property (e.g., \"energies\" for the `Energies` property), this value typically should be set at the time of instantiation to a unique and appropriate key. Setting the `name` field will be essentialy for records that contain, e.g., multiple energy entries (e.g., total_energy, dispersion_energy, electronic_energy, etc.). "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56ed64ae-5940-4f84-91bf-8fdfce6e13f0",
   "metadata": {},
   "source": [
    "The following demonstrates defining a record with properties \"atomic_numbers\", \"positions\", \"total_energies\", \"dispersion_energies\", and \"smiles\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "bc123b4c-9a29-4703-82ee-fd2bc8eaa970",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "name: mol1\n",
      "* n_atoms: 2\n",
      "* n_configs: 1\n",
      "* atomic_numbers:\n",
      " -  name='atomic_numbers' value=array([[1],\n",
      "       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2\n",
      "* per-atom properties: (['positions']):\n",
      " -  name='positions' value=array([[[1., 1., 1.],\n",
      "        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=1 n_atoms=2\n",
      "* per-system properties: (['total_energies', 'dispersion_energies']):\n",
      " -  name='total_energies' value=array([[1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None\n",
      " -  name='dispersion_energies' value=array([[0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None\n",
      "* meta_data: (['smiles'])\n",
      " -  name='smiles' value='[CH]' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None\n",
      "\n"
     ]
    }
   ],
   "source": [
    "atomic_numbers = AtomicNumbers(value=np.array([[1], [6]]))\n",
    "\n",
    "positions = Positions(\n",
    "    value=np.array([[[1.0, 1.0, 1.0], [2.0, 2.0, 2.0]]]), \n",
    "    units=\"nanometer\"\n",
    ")\n",
    "\n",
    "total_energies = Energies(\n",
    "    name=\"total_energies\",\n",
    "    value=np.array([[1]]), \n",
    "    units=unit.hartree\n",
    ")\n",
    "\n",
    "dispersion_energies = Energies(\n",
    "    name=\"dispersion_energies\",\n",
    "    value=np.array([[0.1]]), \n",
    "    units=unit.hartree\n",
    ")   \n",
    "\n",
    "smiles = MetaData(name='smiles', value='[CH]')\n",
    "\n",
    "record_mol1 = Record(name='mol1')\n",
    "record_mol1.add_properties([total_energies, dispersion_energies, atomic_numbers, positions, smiles])\n",
    "\n",
    "print(record_mol1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "622b5927-5131-454d-bfd9-0f992b337b33",
   "metadata": {},
   "source": [
    "As noted in the \"basic_usage.ipynb\" notebook, the `name` field is used as a unique key.  An error will be raised if we try to add a property with the same key twice. E.g., the following will raise an error as we have already set the \"total_energies\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "0b3504af-1a92-4539-89c6-a04691aea222",
   "metadata": {},
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "Property with name total_energies already exists in the record mol1.Set append_property=True to append to the existing property.",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[3], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mrecord_mol1\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43madd_property\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtotal_energies\u001b[49m\u001b[43m)\u001b[49m\n",
      "File \u001b[0;32m~/PycharmProjects/modelforge/modelforge-curate/modelforge/curate/record.py:433\u001b[0m, in \u001b[0;36mRecord.add_property\u001b[0;34m(self, property)\u001b[0m\n\u001b[1;32m    429\u001b[0m     error_msg \u001b[38;5;241m=\u001b[39m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mProperty with name \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mproperty\u001b[39m\u001b[38;5;241m.\u001b[39mname\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m already exists in the record \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mname\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m    430\u001b[0m     error_msg \u001b[38;5;241m+\u001b[39m\u001b[38;5;241m=\u001b[39m (\n\u001b[1;32m    431\u001b[0m         \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mSet append_property=True to append to the existing property.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m    432\u001b[0m     )\n\u001b[0;32m--> 433\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(error_msg)\n\u001b[1;32m    435\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m (\n\u001b[1;32m    436\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mper_system[\u001b[38;5;28mproperty\u001b[39m\u001b[38;5;241m.\u001b[39mname]\u001b[38;5;241m.\u001b[39mvalue\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m1\u001b[39m]\n\u001b[1;32m    437\u001b[0m     \u001b[38;5;241m==\u001b[39m \u001b[38;5;28mproperty\u001b[39m\u001b[38;5;241m.\u001b[39mvalue\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m1\u001b[39m]\n\u001b[1;32m    438\u001b[0m )\n\u001b[1;32m    439\u001b[0m temp_array \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mproperty\u001b[39m\u001b[38;5;241m.\u001b[39mvalue\n",
      "\u001b[0;31mValueError\u001b[0m: Property with name total_energies already exists in the record mol1.Set append_property=True to append to the existing property."
     ]
    }
   ],
   "source": [
    "record_mol1.add_property(total_energies)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00ce9b70-6f36-4b67-bfb0-f4156fe7f0f9",
   "metadata": {},
   "source": [
    "## Appending properties\n",
    "\n",
    "In some cases, we may not have data for all configurations available to use when instantiating a property.  For example, the positions for different configurations may exist in different .xyz files.  To handle these cases, the `Record` class can be instantiated with `append_property` set to `True`.  In such cases, adding a property a second time will append the new data to the existing array. \n",
    "\n",
    "For example, the following will use initialize the same `Record` as above, but allowing properties to be appended:abs\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "e4051c70-93ba-4111-97fd-fa09d4e67b6f",
   "metadata": {},
   "outputs": [],
   "source": [
    "record_mol1_append = Record(name='mol1', append_property=\"True\")\n",
    "record_mol1_append.add_properties([total_energies, dispersion_energies, atomic_numbers, positions, smiles])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15208d29-65c7-4255-b774-506234bced89",
   "metadata": {},
   "source": [
    "Now, if we add \"total_energies\" a second time, this will not raise an error, rather it will append the energy to the existing array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "ed71aafa-e613-49ff-a943-5463e6ff806d",
   "metadata": {},
   "outputs": [],
   "source": [
    "record_mol1_append.add_property(total_energies)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb7d3d90-7bc1-4a70-9733-9c4edff76057",
   "metadata": {},
   "source": [
    "If print the record we will now see that the \"total_energies\" property now contains `value` = `[[1], [1]]` and reports n_configs = 2.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "43599eaa-8dd3-49ab-9fc3-feabb2360490",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32m2025-05-28 16:08:14.963\u001b[0m | \u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[36mmodelforge.curate.record\u001b[0m:\u001b[36m_validate_n_configs\u001b[0m:\u001b[36m265\u001b[0m - \u001b[33m\u001b[1mNumber of configurations for properties in record mol1 are not consistent.\u001b[0m\n",
      "\u001b[32m2025-05-28 16:08:14.965\u001b[0m | \u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[36mmodelforge.curate.record\u001b[0m:\u001b[36m_validate_n_configs\u001b[0m:\u001b[36m269\u001b[0m - \u001b[33m\u001b[1m - positions : 1\u001b[0m\n",
      "\u001b[32m2025-05-28 16:08:14.966\u001b[0m | \u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[36mmodelforge.curate.record\u001b[0m:\u001b[36m_validate_n_configs\u001b[0m:\u001b[36m271\u001b[0m - \u001b[33m\u001b[1m - total_energies : 2\u001b[0m\n",
      "\u001b[32m2025-05-28 16:08:14.967\u001b[0m | \u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[36mmodelforge.curate.record\u001b[0m:\u001b[36m_validate_n_configs\u001b[0m:\u001b[36m271\u001b[0m - \u001b[33m\u001b[1m - dispersion_energies : 1\u001b[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "name: mol1\n",
      "* n_atoms: 2\n",
      "* n_configs: cannot be determined, see warnings log\n",
      "* atomic_numbers:\n",
      " -  name='atomic_numbers' value=array([[1],\n",
      "       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2\n",
      "* per-atom properties: (['positions']):\n",
      " -  name='positions' value=array([[[1., 1., 1.],\n",
      "        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=1 n_atoms=2\n",
      "* per-system properties: (['total_energies', 'dispersion_energies']):\n",
      " -  name='total_energies' value=array([[1],\n",
      "       [1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=2 n_atoms=None\n",
      " -  name='dispersion_energies' value=array([[0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None\n",
      "* meta_data: (['smiles'])\n",
      " -  name='smiles' value='[CH]' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(record_mol1_append)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "357268d0-a008-436e-adfc-6fe0c1f5723f",
   "metadata": {},
   "source": [
    "Note, this produces several warnings because the number of configurations is now not consistent in the record (printing the record calls the validate function in the class)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "7abdccc2-e49a-455f-b8ea-9adad54aea7f",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32m2025-05-28 16:08:16.184\u001b[0m | \u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[36mmodelforge.curate.record\u001b[0m:\u001b[36m_validate_n_configs\u001b[0m:\u001b[36m265\u001b[0m - \u001b[33m\u001b[1mNumber of configurations for properties in record mol1 are not consistent.\u001b[0m\n",
      "\u001b[32m2025-05-28 16:08:16.185\u001b[0m | \u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[36mmodelforge.curate.record\u001b[0m:\u001b[36m_validate_n_configs\u001b[0m:\u001b[36m269\u001b[0m - \u001b[33m\u001b[1m - positions : 1\u001b[0m\n",
      "\u001b[32m2025-05-28 16:08:16.187\u001b[0m | \u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[36mmodelforge.curate.record\u001b[0m:\u001b[36m_validate_n_configs\u001b[0m:\u001b[36m271\u001b[0m - \u001b[33m\u001b[1m - total_energies : 2\u001b[0m\n",
      "\u001b[32m2025-05-28 16:08:16.188\u001b[0m | \u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[36mmodelforge.curate.record\u001b[0m:\u001b[36m_validate_n_configs\u001b[0m:\u001b[36m271\u001b[0m - \u001b[33m\u001b[1m - dispersion_energies : 1\u001b[0m\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "record_mol1_append.validate()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8fce8da8-5293-4004-8994-c01dc2f467d5",
   "metadata": {},
   "source": [
    "To resolve this we simply can add the \"positions\" and \"dispersion_energies\" a second time as well:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "92050288-0865-4c02-a588-c6e2b4e9d644",
   "metadata": {},
   "outputs": [],
   "source": [
    "record_mol1_append.add_properties([dispersion_energies, positions])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "eea76650-66e6-4cf9-b818-bc580794f1ff",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "name: mol1\n",
      "* n_atoms: 2\n",
      "* n_configs: 2\n",
      "* atomic_numbers:\n",
      " -  name='atomic_numbers' value=array([[1],\n",
      "       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2\n",
      "* per-atom properties: (['positions']):\n",
      " -  name='positions' value=array([[[1., 1., 1.],\n",
      "        [2., 2., 2.]],\n",
      "\n",
      "       [[1., 1., 1.],\n",
      "        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=2 n_atoms=2\n",
      "* per-system properties: (['total_energies', 'dispersion_energies']):\n",
      " -  name='total_energies' value=array([[1],\n",
      "       [1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=2 n_atoms=None\n",
      " -  name='dispersion_energies' value=array([[0.1],\n",
      "       [0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=2 n_atoms=None\n",
      "* meta_data: (['smiles'])\n",
      " -  name='smiles' value='[CH]' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(record_mol1_append)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba0eac30-6d69-4ff3-b660-57914949d873",
   "metadata": {},
   "source": [
    "When appending to an existing property, the code will first check to see if the shapes of the array are compatible.  For example, if we try to add positions for a molecule with a different number of atoms, this will produce an error, as the shapes of the arrays are not compatible. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "ee6e6140-5601-4c5f-bc07-1ac5e57943c7",
   "metadata": {},
   "outputs": [
    {
     "ename": "AssertionError",
     "evalue": "mol1: n_atoms of positions does not: 3 != 2.",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mAssertionError\u001b[0m                            Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[10], line 3\u001b[0m\n\u001b[1;32m      1\u001b[0m positions2 \u001b[38;5;241m=\u001b[39m Positions(value\u001b[38;5;241m=\u001b[39m [[[\u001b[38;5;241m1\u001b[39m,\u001b[38;5;241m1\u001b[39m,\u001b[38;5;241m1\u001b[39m], [\u001b[38;5;241m2\u001b[39m,\u001b[38;5;241m2\u001b[39m,\u001b[38;5;241m2\u001b[39m], [\u001b[38;5;241m3\u001b[39m,\u001b[38;5;241m3\u001b[39m,\u001b[38;5;241m3\u001b[39m]]], units\u001b[38;5;241m=\u001b[39munit\u001b[38;5;241m.\u001b[39mnanometer)\n\u001b[0;32m----> 3\u001b[0m \u001b[43mrecord_mol1_append\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43madd_property\u001b[49m\u001b[43m(\u001b[49m\u001b[43mpositions2\u001b[49m\u001b[43m)\u001b[49m\n",
      "File \u001b[0;32m~/PycharmProjects/modelforge/modelforge-curate/modelforge/curate/record.py:385\u001b[0m, in \u001b[0;36mRecord.add_property\u001b[0;34m(self, property)\u001b[0m\n\u001b[1;32m    380\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(error_msg)\n\u001b[1;32m    381\u001b[0m \u001b[38;5;66;03m# if the property already exists, we will use vstack to add it to the existing array\u001b[39;00m\n\u001b[1;32m    382\u001b[0m \u001b[38;5;66;03m# after first checking that the dimensions are consistent\u001b[39;00m\n\u001b[1;32m    383\u001b[0m \u001b[38;5;66;03m# note we do not check shape[0], as that corresponds to the number of configurations\u001b[39;00m\n\u001b[1;32m    384\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m (\n\u001b[0;32m--> 385\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mper_atom[\u001b[38;5;28mproperty\u001b[39m\u001b[38;5;241m.\u001b[39mname]\u001b[38;5;241m.\u001b[39mvalue\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m1\u001b[39m]\n\u001b[1;32m    386\u001b[0m     \u001b[38;5;241m==\u001b[39m \u001b[38;5;28mproperty\u001b[39m\u001b[38;5;241m.\u001b[39mvalue\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m1\u001b[39m]\n\u001b[1;32m    387\u001b[0m ), \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mname\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m: n_atoms of \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mproperty\u001b[39m\u001b[38;5;241m.\u001b[39mname\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m does not: \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mproperty\u001b[39m\u001b[38;5;241m.\u001b[39mvalue\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m1\u001b[39m]\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m != \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mper_atom[\u001b[38;5;28mproperty\u001b[39m\u001b[38;5;241m.\u001b[39mname]\u001b[38;5;241m.\u001b[39mvalue\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m1\u001b[39m]\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m    388\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m (\n\u001b[1;32m    389\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mper_atom[\u001b[38;5;28mproperty\u001b[39m\u001b[38;5;241m.\u001b[39mname]\u001b[38;5;241m.\u001b[39mvalue\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m2\u001b[39m]\n\u001b[1;32m    390\u001b[0m     \u001b[38;5;241m==\u001b[39m \u001b[38;5;28mproperty\u001b[39m\u001b[38;5;241m.\u001b[39mvalue\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m2\u001b[39m]\n\u001b[1;32m    391\u001b[0m )\n\u001b[1;32m    392\u001b[0m \u001b[38;5;66;03m# In order to append to the array, everything needs to have the same units\u001b[39;00m\n\u001b[1;32m    393\u001b[0m \u001b[38;5;66;03m# We will use the units of the first property that was added\u001b[39;00m\n",
      "\u001b[0;31mAssertionError\u001b[0m: mol1: n_atoms of positions does not: 3 != 2."
     ]
    }
   ],
   "source": [
    "positions2 = Positions(value= [[[1,1,1], [2,2,2], [3,3,3]]], units=unit.nanometer)\n",
    "\n",
    "record_mol1_append.add_property(positions2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66289212-3e55-4931-9f0d-a3e04b3739e5",
   "metadata": {},
   "source": [
    "The units are also compared and converted if necessary before appending.  For example, we defined energy in units of hartree above;  if we define energy in a different unit and append, it will automatically be converted to hartrees. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "a11c8fdc-ba88-41f3-bef4-e5f8eb208abf",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32m2025-05-28 16:08:20.506\u001b[0m | \u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[36mmodelforge.curate.record\u001b[0m:\u001b[36m_validate_n_configs\u001b[0m:\u001b[36m265\u001b[0m - \u001b[33m\u001b[1mNumber of configurations for properties in record mol1 are not consistent.\u001b[0m\n",
      "\u001b[32m2025-05-28 16:08:20.508\u001b[0m | \u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[36mmodelforge.curate.record\u001b[0m:\u001b[36m_validate_n_configs\u001b[0m:\u001b[36m269\u001b[0m - \u001b[33m\u001b[1m - positions : 2\u001b[0m\n",
      "\u001b[32m2025-05-28 16:08:20.509\u001b[0m | \u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[36mmodelforge.curate.record\u001b[0m:\u001b[36m_validate_n_configs\u001b[0m:\u001b[36m271\u001b[0m - \u001b[33m\u001b[1m - total_energies : 3\u001b[0m\n",
      "\u001b[32m2025-05-28 16:08:20.509\u001b[0m | \u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[36mmodelforge.curate.record\u001b[0m:\u001b[36m_validate_n_configs\u001b[0m:\u001b[36m271\u001b[0m - \u001b[33m\u001b[1m - dispersion_energies : 2\u001b[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "name: mol1\n",
      "* n_atoms: 2\n",
      "* n_configs: cannot be determined, see warnings log\n",
      "* atomic_numbers:\n",
      " -  name='atomic_numbers' value=array([[1],\n",
      "       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2\n",
      "* per-atom properties: (['positions']):\n",
      " -  name='positions' value=array([[[1., 1., 1.],\n",
      "        [2., 2., 2.]],\n",
      "\n",
      "       [[1., 1., 1.],\n",
      "        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=2 n_atoms=2\n",
      "* per-system properties: (['total_energies', 'dispersion_energies']):\n",
      " -  name='total_energies' value=array([[1.       ],\n",
      "       [1.       ],\n",
      "       [0.0015936]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=3 n_atoms=None\n",
      " -  name='dispersion_energies' value=array([[0.1],\n",
      "       [0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=2 n_atoms=None\n",
      "* meta_data: (['smiles'])\n",
      " -  name='smiles' value='[CH]' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None\n",
      "\n"
     ]
    }
   ],
   "source": [
    "total_energies2 = Energies(\n",
    "    name=\"total_energies\",\n",
    "    value=np.array([[1]]), \n",
    "    units=unit.kilocalories_per_mole\n",
    ")\n",
    "record_mol1_append.add_property(total_energies2)\n",
    "\n",
    "print(record_mol1_append)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca9a977c-0c6e-4656-9416-593636b13692",
   "metadata": {},
   "source": [
    "## Adding properties directly to a dataset\n",
    "\n",
    "Rather than creating an instance of the `Record` class and adding this to the dataset, we can use the `SourceDataset` class directly. The functions in `SourceDataset` effectively just provide wrappers to the functions that exist within the `Record` class. As such, both approaches are equivalent but one may be more convenient depending on the structure of the original dataset that is being curated. \n",
    "\n",
    "The following code performs the same functionality in the two ways. First we will define the common elements (i.e., properties):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "dd6c66ff-58b1-4445-965b-a7042190ec26",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32m2025-05-28 16:08:21.307\u001b[0m | \u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[36mmodelforge.curate.sourcedataset\u001b[0m:\u001b[36m__init__\u001b[0m:\u001b[36m66\u001b[0m - \u001b[33m\u001b[1mDatabase file test_dataset.sqlite already exists in ./. Removing it.\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "#define the datset\n",
    "new_dataset = SourceDataset('test_dataset')\n",
    "\n",
    "# define the properties\n",
    "atomic_numbers = AtomicNumbers(value=np.array([[1], [6]]))\n",
    "positions = Positions(\n",
    "    value=np.array([[[1.0, 1.0, 1.0], [2.0, 2.0, 2.0]]]), \n",
    "    units=\"nanometer\"\n",
    ")\n",
    "\n",
    "total_energies = Energies(\n",
    "    name=\"total_energies\",\n",
    "    value=np.array([[1]]), \n",
    "    units=unit.hartree\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "adab4462-cfa5-40f7-9457-1cd0582cf399",
   "metadata": {},
   "source": [
    "Approach 1: Create a Record, add properties to the Record, add Record to the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "246280c6-19af-485d-be08-95d05066b9b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "record_mol1 = Record(\"mol1\")\n",
    "record_mol1.add_properties([atomic_numbers, positions, total_energies])\n",
    "\n",
    "new_dataset.add_record(record_mol1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10d613a6-3340-4f76-8cd4-552129daf0cc",
   "metadata": {},
   "source": [
    "Approach 2: Create a Record within the dataset, add properties to this record within the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "e2c8d973-61eb-4618-9183-d2a5f4ab7dd1",
   "metadata": {},
   "outputs": [],
   "source": [
    "new_dataset.create_record('mol2')\n",
    "new_dataset.add_properties(\"mol2\", [atomic_numbers, positions, total_energies])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "899af36c-f4b3-4712-8962-82d7b5fe34fa",
   "metadata": {},
   "source": [
    "The dataset can also be instantiated with `append_property` set to `True`; the wrapper function within the dataset provides the same functionality as when interacting directly with a record. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "fdfedfdc-ed1a-4c08-9229-e59f2d9765ca",
   "metadata": {},
   "outputs": [],
   "source": [
    "appendable_dataset = SourceDataset(name=\"appendable\", append_property=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8e9067d8-31ab-42aa-8f0f-6f3c46c0b3cd",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}