Skip to content

compare by inchikey, not canonical smiles #131

@skinnider

Description

@skinnider

Currently, canonical SMILES are being used to ...

  • split the training set into cross-validation folds (create_training_sets.py)
  • tabulate the frequency with which unique molecules appear in any given train_seed/sample_seed output (inner_tabulate_molecules.py)
  • tabulate the frequency with which unique molecules across train_seed/sample_seed outputs (inner_collect_tabulated_molecules.py)
  • average molecules across CV folds (inner_process_tabulated_molecules.py)
  • remove training set molecules from model output (inner_collect_tabulated_molecules, inner_process_tabulated_molecules.py)
  • calculate top-k accuracy (inner_write_structural_prior_CV)

(Did I miss any places?)

Occasionally, two or more canonical SMILES can represent the same compound, due to a phenomenon called tautomerism. This can happen either in the training set (i.e. we might have two different canonical SMILES representing the same InChIkey in the training set) or in the model output (i.e. the model might generate multiple different SMILES that all represent the same InChIkey). Therefore, the inchikey is a more robust way to perform all of the above operations.

On the other hand, the inchikey is not a unique representation of a molecule (i.e. we can't parse an inchikey into a molecule in rdkit in the same way that we can with SMILES), so we do need to keep track of one (representative) canonical SMILES per inchikey in all of the above steps.

@vineetbansal let me know if anything here is unclear.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions