Update README.md

duartegroup · May 9, 2024 · 850a845 · 850a845
1 parent 48f5a49
commit 850a845
Showing 1 changed file with 4 additions and 14 deletions.
diff --git a/data/ring_dataset/README.md b/data/ring_dataset/README.md
@@ -1,18 +1,8 @@
 # Ring dataset
-# USPTO data
-The `cjhif_ring_formations.csv` file contains ring formation reactions extracted from the CJHIF dataset (https://github.com/jshmjs45/data_for_chem/tree/master).
 
-Each row corresponds to a separate reaction and includes:
-* Id
-* mapped_rxn - atom-mapped reaction (using rxnmapper: https://github.com/rxn4chemistry/rxnmapper)
-* confidence - confidence of the atom-mapping
-* Rxn - canonicalised reaction
-* Reactants (excluding reagents)
-* Product
+To retrain the models, split your reaction dataset into reactants and products, tokenize (using function from https://github.com/pschwllr/MolecularTransformer) and save here as separate files with one entry per line (following the format of uspto and recent datasets).
+
+In our work the Ring dataset contained ring formation reactions extracted from CJHIF combined with heterocycle formation reactions from Pistachio (https://www.nextmovesoftware.com/pistachio.html).
+The full dataset was split into train, validation and test sets with a 80:10:10 ratio using the Fingerprint Splitter from DeepChem (https://deepchem.readthedocs.io/en/latest/api_reference/splitters.html#fingerprintsplitter) based on the reaction product.
 
-In our work the CJHIF-based dataset was combined with heterocycle formation reactions from Pistachio (https://www.nextmovesoftware.com/pistachio.html).
-To reproduce the results, combine `cjhif_ring_formations.csv` with canonicalised Pistachio reactions belonging to the "Heterocycle formations" superclass (class 4) and drop duplicates.
-The full dataset then needs to be split into train, validation and test sets with a 80:10:10 ration. In our case this was done using the Fingerprint Splitter from DeepChem (https://deepchem.readthedocs.io/en/latest/api_reference/splitters.html#fingerprintsplitter) based on the reaction product.
-The reactions then need to be split into reactants and products, tokenized (using function from https://github.com/pschwllr/MolecularTransformer) and saved as separate files with one entry per line (following the format of uspto and recent datasets).
 
-To train the models using just the CJHIF dataset, follow the approach described above but starting with splitting the `cjhif_ring_formations.csv` dataset into train, validation and test sets.