-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Currently, canonical SMILES are being used to ...
- split the training set into cross-validation folds (create_training_sets.py)
- tabulate the frequency with which unique molecules appear in any given train_seed/sample_seed output (inner_tabulate_molecules.py)
- tabulate the frequency with which unique molecules across train_seed/sample_seed outputs (inner_collect_tabulated_molecules.py)
- average molecules across CV folds (inner_process_tabulated_molecules.py)
- remove training set molecules from model output (inner_collect_tabulated_molecules, inner_process_tabulated_molecules.py)
- calculate top-k accuracy (inner_write_structural_prior_CV)
(Did I miss any places?)
Occasionally, two or more canonical SMILES can represent the same compound, due to a phenomenon called tautomerism. This can happen either in the training set (i.e. we might have two different canonical SMILES representing the same InChIkey in the training set) or in the model output (i.e. the model might generate multiple different SMILES that all represent the same InChIkey). Therefore, the inchikey is a more robust way to perform all of the above operations.
On the other hand, the inchikey is not a unique representation of a molecule (i.e. we can't parse an inchikey into a molecule in rdkit in the same way that we can with SMILES), so we do need to keep track of one (representative) canonical SMILES per inchikey in all of the above steps.
@vineetbansal let me know if anything here is unclear.