Skip to content

SMILES Augmentation

Aditya Khedekar edited this page Dec 4, 2025 · 3 revisions

How the SMILES augmentation works

The SMILES augmentation procedure generates alternative but chemically equivalent representations by randomizing both atom order and fragment order. Each input SMILES is first converted to an RDKit molecule and split into its disconnected fragments (e.g., "CCO.CN""CCO" and "CN"). For each fragment, the atom indices are shuffled, and a new SMILES string is generated by repeatedly drawing a starting atom index from this shuffled list and letting RDKit produce a randomized traversal. When a molecule contains multiple fragments, the augmentation further permutes the fragment order and combines randomized variants across fragments (e.g., "CCO" + "CN""CCO.CN" or "CN.CCO"). The resulting augmented SMILES are therefore reordered or rearranged string forms that still represent exactly the same underlying molecular structure. See the snippet of ChEBI augmented SMILES.

Example Input SMILES: CCO.CN

Possible augmentations:

  • Randomized fragments: "OCC", "CCO", "NC", "CN"
  • Permuted combinations: "OCC.CN", "NC.OCC", "CCO.NC", etc.

These variations preserve the chemistry but diversify the string forms the model sees.

How to Use SMILES Augmentation

To train your model with augmented SMILES, pass the following two additional arguments in the Lightning CLI config:

  1. --data.augment_smiles=True

    This flag tells the data pipeline to generate augmented SMILES if they haven’t been created yet and to use them during training.

  2. --data.aug_smiles_variations=5

    This argument specifies the maximum number of unique augmented SMILES to generate per original SMILES string. For example, if 5 is specified but only 3 unique augmentations are possible, only 3 will be generated for that SMILES.

Note: A separate file is created for the augmented SMILES data with the name aug_data_var{aug_smiles_variations}.pkl.

Snippet of augmented SMILES

View of small subset of augmented ChEBI-241 SMILES data with maximum 5 augmentation for each SMILES:

ident name SMILES
28741 sodium fluoride [F-].[Na+]
28741 sodium fluoride [Na+].[F-]
28741 sodium fluoride [F-].[Na+]
32129 diamminesilver(1+) fluoride [F-].[H][N]([H])([H])[Ag+][N]([H])([H])[H]
32129 diamminesilver(1+) fluoride [F-].N->[Ag+]<-N
32129 diamminesilver(1+) fluoride [Ag+](<-N)<-N.[F-]
32129 diamminesilver(1+) fluoride [F-].[Ag+](<-N)<-N
32129 diamminesilver(1+) fluoride N->[Ag+]<-N.[F-]
30340 silver monofluoride [F-].[Ag+]
30340 silver monofluoride [F-].[Ag+]
30340 silver monofluoride [Ag+].[F-]
51990 tetrabutylammonium fluoride [F-].CCCC[N+](CCCC)(CCCC)CCCC
51990 tetrabutylammonium fluoride [F-].CCCC[N+](CCCC)(CCCC)CCCC
51990 tetrabutylammonium fluoride CCCC[N+](CCCC)(CCCC)CCCC.[F-]
51990 tetrabutylammonium fluoride C(C)CC[N+](CCCC)(CCCC)CCCC.[F-]
51990 tetrabutylammonium fluoride [F-].C(C)CC[N+](CCCC)(CCCC)CCCC
49499 beryllium difluoride F[Be]F
49499 beryllium difluoride [Be](F)F
49499 beryllium difluoride F[Be]F

Clone this wiki locally