compare by inchikey, not canonical smiles

Currently, canonical SMILES are being used to ...

- split the training set into cross-validation folds (create_training_sets.py)
- tabulate the frequency with which unique molecules appear in any given train_seed/sample_seed output (inner_tabulate_molecules.py)
- tabulate the frequency with which unique molecules across train_seed/sample_seed outputs (inner_collect_tabulated_molecules.py)
- average molecules across CV folds (inner_process_tabulated_molecules.py)
- remove training set molecules from model output (inner_collect_tabulated_molecules, inner_process_tabulated_molecules.py)
- calculate top-k accuracy (inner_write_structural_prior_CV)

(Did I miss any places?)

Occasionally, two or more canonical SMILES can represent the same compound, due to a phenomenon called [tautomerism](https://en.wikipedia.org/wiki/Tautomer). This can happen either in the training set (i.e. we might have two different canonical SMILES representing the same InChIkey in the training set) or in the model output (i.e. the model might generate multiple different SMILES that all represent the same InChIkey). Therefore, the inchikey is a more robust way to perform all of the above operations.

On the other hand, the inchikey is not a unique representation of a molecule (i.e. we can't parse an inchikey into a molecule in rdkit in the same way that we can with SMILES), so we do need to keep track of one (representative) canonical SMILES per inchikey in all of the above steps. 

@vineetbansal let me know if anything here is unclear. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compare by inchikey, not canonical smiles #131

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

compare by inchikey, not canonical smiles #131

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions