atomistic-machine-learning · stefaanhessmann · Feb 7, 2025 · Jan 14, 2025 · Jan 14, 2025 · Jan 14, 2025
diff --git a/.gitignore b/.gitignore
@@ -36,6 +36,8 @@ wheels/
 *.egg-info/
 .installed.cfg
 *.egg
+.vscode
+.github
 
 # PyInstaller
 #  Usually these files are written by a python script from a template
@@ -99,6 +101,7 @@ celerybeat-schedule
 .venv
 venv/
 ENV/
+env/
 
 # Spyder project settings
 .spyderproject

diff --git a/README.md b/README.md
@@ -122,7 +122,7 @@ spktrain experiment=qm9_atomwise run.data_dir=<path> model/representation=painn
 ```
 
 For more details on config groups, have a look at the
-[Hydra docs](https://hydra.cc/docs/next/tutorials/basic/your_first_app/config_groups).
+[Hydra docs](https://hydra.cc/docs/tutorials/basic/your_first_app/config_groups/).
 
 
 ### Example 2: Potential energy surfaces

diff --git a/builder.docx b/builder.docx
diff --git a/docs/getstarted.rst b/docs/getstarted.rst
@@ -76,9 +76,9 @@ All values of the config can be changed from the command line, including the dir
 By default, the model is stored in a directory with a unique run id hash as a subdirectory of ``spk_workdir/runs``.
 This can be changed as follows::
 
-   $ spktrain experiment=qm9 run.data_dir=/my/data/dir run.path=~/all_my_runs run.id=this_run
+   $ spktrain experiment=qm9_atomwise run.data_dir=/my/data/dir run.path=~/all_my_runs run.id=this_run
 
-If you call ``spktrain experiment=qm9 --help``, you can see the full config with all the parameters
+If you call ``spktrain experiment=qm9_atomwise --help``, you can see the full config with all the parameters
 that can be changed.
 Nested parameters can be changed as follows::
 
@@ -114,7 +114,7 @@ If you would want to additionally change some value of this group, you could use
     $ spktrain experiment=qm9_atomwise data_dir=<path> model/representation=painn model.representation.n_interactions=5
 
 For more details on config groups, have a look at the
-`Hydra docs <https://hydra.cc/docs/next/tutorials/basic/your_first_app/config_groups>`_.
+`Hydra docs <https://hydra.cc/docs/tutorials/basic/your_first_app/config_groups/>`_.
 
 
 Example 2: Potential energy surfaces

diff --git a/examples/tutorials/tutorial_01_preparing_data.ipynb b/examples/tutorials/tutorial_01_preparing_data.ipynb
@@ -16,7 +16,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -348,18 +348,21 @@
   },
   {
    "cell_type": "markdown",
+   "metadata": {
+    "collapsed": false
+   },
    "source": [
     "To get a better initialization of the network and avoid numerical issues, we often want to make use of simple statistics of our target properties. The most simple approach is to subtract the mean value of our target property from the labels before training such that the neural networks only have to learn the difference from the mean prediction. A more sophisticated approach is to use so-called atomic reference values that provide basic statistics of our target property based on the atom types in a structure. This is especially useful for extensive properties such as the energy, where the single atom energies contribute a major part to the overall value. If your data comes with atomic reference values, you can add them to the metadata of your `ase` database. The statistics have to be stored in a dictionary with the property names as keys and the atomic reference values as lists where the list indices match the atomic numbers. For further explanation please have a look at the [QM9 tutorial](https://schnetpack.readthedocs.io/en/latest/tutorials/tutorial_02_qm9.html).\n",
     "\n",
     "Here is an example:"
-   ],
-   "metadata": {
-    "collapsed": false
-   }
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": 2,
+   "metadata": {
+    "collapsed": false
+   },
    "outputs": [],
    "source": [
     "# calculate this at the same level of theory as your data\n",
@@ -376,19 +379,16 @@
     "#     property_unit_dict={'energy':'kcal/mol'},\n",
     "#     atomref=atomref\n",
     "# )"
-   ],
-   "metadata": {
-    "collapsed": false
-   }
+   ]
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "In our concrete case, we only have an MD trajectory of a single system. Therefore, we don't need to specify an atomref, since removing the average energy will working as well."
-   ],
    "metadata": {
     "collapsed": false
-   }
+   },
+   "source": [
+    "In our concrete case, we only have an MD trajectory of a single system. Therefore, we don't need to specify an atomref, since removing the average energy will working as well."
+   ]
   },
   {
    "cell_type": "markdown",
@@ -447,17 +447,21 @@
   },
   {
    "cell_type": "markdown",
+   "metadata": {
+    "collapsed": false
+   },
    "source": [
     "## Using your data for training\n",
     "We have now used the class `ASEAtomsData` to create a new `ase` database for our custom data. `schnetpack.data.ASEAtomsData` is a subclass of `pytorch.data.Dataset` and could be utilized for training models with `pytorch`. However, we use `pytorch-lightning` to conveniently handle the training procedure for us. This requires us to wrap the dataset in a [LightningDataModule](https://lightning.ai/docs/pytorch/stable/data/datamodule.html). We provide a general purpose `AtomsDataModule` for atomic systems in `schnetpack.data.datamodule.AtomsDataModule`. The data module will handle the unit conversion, splitting, batching and the preprocessing of the data with `transforms`. We can instantiate the data module for our custom dataset with:"
-   ],
-   "metadata": {
-    "collapsed": false
-   }
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "metadata": {
+    "collapsed": false,
+    "is_executing": true
+   },
    "outputs": [],
    "source": [
     "import schnetpack as spk\n",
@@ -480,27 +484,23 @@
     ")\n",
     "custom_data.prepare_data()\n",
     "custom_data.setup()"
-   ],
-   "metadata": {
-    "collapsed": false,
-    "is_executing": true
-   }
+   ]
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "Please note that for the general case it makes sense to use your dataset within command line interface (see: [here](https://schnetpack.readthedocs.io/en/latest/userguide/configs.html)). For some benchmark datasets we provide data modules with download functions and more utilities in `schnetpack.data.datasets`. Further examples on how to use the data modules is provided in the following sections.\n"
-   ],
    "metadata": {
     "collapsed": false
-   }
+   },
+   "source": [
+    "Please note that for the general case it makes sense to use your dataset within command line interface (see: [here](https://schnetpack.readthedocs.io/en/latest/userguide/configs.html)). For some benchmark datasets we provide data modules with download functions and more utilities in `schnetpack.data.datasets`. Further examples on how to use the data modules is provided in the following sections.\n"
+   ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python [conda env:spkdev] *",
+   "display_name": "Python 3",
    "language": "python",
-   "name": "conda-env-spkdev-py"
+   "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
@@ -512,7 +512,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.11"
+   "version": "3.12.0"
   },
   "nbsphinx": {
    "execute": "never"

diff --git a/pyproject.toml b/pyproject.toml
@@ -19,15 +19,15 @@ authors = [
 description = "SchNetPack - Deep Neural Networks for Atomistic Systems"
 readme = "README.md"
 license = { file="LICENSE" }
-requires-python = ">=3.10"
+requires-python = "==3.12"
 dependencies = [
     "numpy>=2.0.0",
-    "sympy<=1.12",
+    "sympy>=1.13",
     "ase>=3.21",
     "h5py",
     "pyyaml",
     "hydra-core>=1.1.0",
-    "torch>=1.9",
+    "torch>=2.5.0",
     "pytorch_lightning>=2.0.0",
     "torchmetrics",
     "hydra-colorlog>=1.1.0",
@@ -41,7 +41,8 @@ dependencies = [
     "tqdm",
     "pre-commit",
     "black",
-    "protobuf"
+    "protobuf",
+    "progressbar"
 ]
 
 [project.optional-dependencies]

diff --git a/src/schnetpack/configs/data/qm7x.yaml b/src/schnetpack/configs/data/qm7x.yaml
@@ -0,0 +1,9 @@
+defaults:
+  - custom
+
+_target_: schnetpack.datasets.QM7X
+
+datapath: ${run.data_dir}/qm7x.db  # data_dir is specified in train.yaml
+batch_size: 100
+num_train: 5550
+num_val: 700
diff --git a/src/schnetpack/data/atoms.py b/src/schnetpack/data/atoms.py
@@ -176,6 +176,7 @@ def add_systems(
         self,
         property_list: List[Dict[str, Any]],
         atoms_list: Optional[List[Atoms]] = None,
+        key_value_list: Optional[List[Dict[str, Any]]] = None,
     ):
         pass
 
@@ -463,6 +464,7 @@ def add_systems(
         self,
         property_list: List[Dict[str, Any]],
         atoms_list: Optional[List[Atoms]] = None,
+        key_value_list: Optional[List[Dict[str, Any]]] = None,
     ):
         """
         Add atoms data to the dataset.
@@ -475,14 +477,31 @@ def add_systems(
                 order as corresponding list of `atoms`.
                 Keys have to match the `available_properties` of the dataset
                 plus additional structure properties, if atoms is None.
+            key_value_list: Properties as list of key-value pairs in the same
+                order as corresponding list of `atoms`.
+                Keys have to match the `available_properties` of the dataset
+                plus additional structure properties, if atoms is None.
         """
         if atoms_list is None:
             atoms_list = [None] * len(property_list)
 
-        for at, prop in zip(atoms_list, property_list):
-            self._add_system(self.conn, at, **prop)
+        # for at, prop in zip(atoms_list, property_list):
+        #     self._add_system(self.conn, at, **prop)
+        for at, prop, key_val in zip(atoms_list, property_list, key_value_list):
+            self._add_system(
+                self.conn,
+                at,
+                key_val,
+                **prop,
+            )
 
-    def _add_system(self, conn, atoms: Optional[Atoms] = None, **properties):
+    def _add_system(
+        self,
+        conn,
+        atoms: Optional[Atoms] = None,
+        key_val: Optional[Dict[str, Any]] = None,
+        **properties,
+    ):
         """Add systems to DB"""
         if atoms is None:
             try:
@@ -499,12 +518,7 @@ def _add_system(self, conn, atoms: Optional[Atoms] = None, **properties):
         # add available properties to database
         valid_props = set().union(
             conn.metadata["_property_unit_dict"].keys(),
-            [
-                structure.Z,
-                structure.R,
-                structure.cell,
-                structure.pbc,
-            ],
+            [structure.Z, structure.R, structure.cell, structure.pbc],
         )
         for prop in properties:
             if prop not in valid_props:
@@ -514,11 +528,22 @@ def _add_system(self, conn, atoms: Optional[Atoms] = None, **properties):
                     + f"provided together with its unit when calling "
                     + f"AseAtomsData.create()."
                 )
+        for key in key_val:
+            if key not in valid_props:
+                logger.warning(
+                    f"Property `{key}` is not a defined property for this dataset and "
+                    + f"will be ignored. If it should be included, it has to be "
+                    + f"provided together with its unit when calling "
+                    + f"AseAtomsData.create()."
+                )
 
         data = {}
         for pname in conn.metadata["_property_unit_dict"].keys():
             try:
-                data[pname] = properties[pname]
+                if pname in properties:
+                    data[pname] = properties[pname]
+                if pname in key_val:
+                    data[pname] = key_val[pname]
             except:
                 raise AtomsDataError("Required property missing:" + pname)
 

diff --git a/src/schnetpack/data/loader.py b/src/schnetpack/data/loader.py
@@ -3,7 +3,7 @@
 
 from typing import Optional, Sequence
 from torch.utils.data import Dataset, Sampler
-from torch.utils.data.dataloader import _collate_fn_t, T_co
+from torch.utils.data.dataloader import _collate_fn_t, _T_co
 
 import schnetpack.properties as structure
 
@@ -63,7 +63,7 @@ class AtomsLoader(DataLoader):
 
     def __init__(
         self,
-        dataset: Dataset[T_co],
+        dataset: Dataset[_T_co],
         batch_size: Optional[int] = 1,
         shuffle: bool = False,
         sampler: Optional[Sampler[int]] = None,

diff --git a/src/schnetpack/data/splitting.py b/src/schnetpack/data/splitting.py
@@ -3,7 +3,7 @@
 import torch
 import numpy as np
 
-__all__ = ["SplittingStrategy", "RandomSplit", "SubsamplePartitions"]
+__all__ = ["SplittingStrategy", "RandomSplit", "SubsamplePartitions", "GroupSplit"]
 
 
 def absolute_split_sizes(dsize: int, split_sizes: List[int]) -> List[int]:

diff --git a/src/schnetpack/datasets/__init__.py b/src/schnetpack/datasets/__init__.py
@@ -7,3 +7,4 @@
 from .materials_project import *
 from .omdb import *
 from .tmqm import *
+from .qm7x import *