-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
48f5a49
commit 850a845
Showing
1 changed file
with
4 additions
and
14 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,18 +1,8 @@ | ||
# Ring dataset | ||
# USPTO data | ||
The `cjhif_ring_formations.csv` file contains ring formation reactions extracted from the CJHIF dataset (https://github.com/jshmjs45/data_for_chem/tree/master). | ||
|
||
Each row corresponds to a separate reaction and includes: | ||
* Id | ||
* mapped_rxn - atom-mapped reaction (using rxnmapper: https://github.com/rxn4chemistry/rxnmapper) | ||
* confidence - confidence of the atom-mapping | ||
* Rxn - canonicalised reaction | ||
* Reactants (excluding reagents) | ||
* Product | ||
To retrain the models, split your reaction dataset into reactants and products, tokenize (using function from https://github.com/pschwllr/MolecularTransformer) and save here as separate files with one entry per line (following the format of uspto and recent datasets). | ||
|
||
In our work the Ring dataset contained ring formation reactions extracted from CJHIF combined with heterocycle formation reactions from Pistachio (https://www.nextmovesoftware.com/pistachio.html). | ||
The full dataset was split into train, validation and test sets with a 80:10:10 ratio using the Fingerprint Splitter from DeepChem (https://deepchem.readthedocs.io/en/latest/api_reference/splitters.html#fingerprintsplitter) based on the reaction product. | ||
|
||
In our work the CJHIF-based dataset was combined with heterocycle formation reactions from Pistachio (https://www.nextmovesoftware.com/pistachio.html). | ||
To reproduce the results, combine `cjhif_ring_formations.csv` with canonicalised Pistachio reactions belonging to the "Heterocycle formations" superclass (class 4) and drop duplicates. | ||
The full dataset then needs to be split into train, validation and test sets with a 80:10:10 ration. In our case this was done using the Fingerprint Splitter from DeepChem (https://deepchem.readthedocs.io/en/latest/api_reference/splitters.html#fingerprintsplitter) based on the reaction product. | ||
The reactions then need to be split into reactants and products, tokenized (using function from https://github.com/pschwllr/MolecularTransformer) and saved as separate files with one entry per line (following the format of uspto and recent datasets). | ||
|
||
To train the models using just the CJHIF dataset, follow the approach described above but starting with splitting the `cjhif_ring_formations.csv` dataset into train, validation and test sets. |