-
Notifications
You must be signed in to change notification settings - Fork 6
SMILES Augmentation
The SMILES augmentation procedure generates alternative but chemically equivalent representations by randomizing both atom order and fragment order. Each input SMILES is first converted to an RDKit molecule and split into its disconnected fragments (e.g., "CCO.CN" → "CCO" and "CN"). For each fragment, the atom indices are shuffled, and a new SMILES string is generated by repeatedly drawing a starting atom index from this shuffled list and letting RDKit produce a randomized traversal. When a molecule contains multiple fragments, the augmentation further permutes the fragment order and combines randomized variants across fragments (e.g., "CCO" + "CN" → "CCO.CN" or "CN.CCO"). The resulting augmented SMILES are therefore reordered or rearranged string forms that still represent exactly the same underlying molecular structure. See the snippet of ChEBI augmented SMILES.
Example
Input SMILES:
CCO.CN
Possible augmentations:
- Randomized fragments:
"OCC","CCO","NC","CN" - Permuted combinations:
"OCC.CN","NC.OCC","CCO.NC", etc.
These variations preserve the chemistry but diversify the string forms the model sees.
To train your model with augmented SMILES, pass the following two additional arguments in the Lightning CLI config:
-
--data.augment_smiles=TrueThis flag tells the data pipeline to generate augmented SMILES if they haven’t been created yet and to use them during training.
-
--data.aug_smiles_variations=5This argument specifies the maximum number of unique augmented SMILES to generate per original SMILES string. For example, if 5 is specified but only 3 unique augmentations are possible, only 3 will be generated for that SMILES.
Note: A separate file is created for the augmented SMILES data with the name
aug_data_var{aug_smiles_variations}.pkl.
View of small subset of augmented ChEBI-241 SMILES data with maximum 5 augmentation for each SMILES:
| ident | name | SMILES |
|---|---|---|
| 28741 | sodium fluoride | [F-].[Na+] |
| 28741 | sodium fluoride | [Na+].[F-] |
| 28741 | sodium fluoride | [F-].[Na+] |
| 32129 | diamminesilver(1+) fluoride | [F-].[H][N]([H])([H])[Ag+][N]([H])([H])[H] |
| 32129 | diamminesilver(1+) fluoride | [F-].N->[Ag+]<-N |
| 32129 | diamminesilver(1+) fluoride | [Ag+](<-N)<-N.[F-] |
| 32129 | diamminesilver(1+) fluoride | [F-].[Ag+](<-N)<-N |
| 32129 | diamminesilver(1+) fluoride | N->[Ag+]<-N.[F-] |
| 30340 | silver monofluoride | [F-].[Ag+] |
| 30340 | silver monofluoride | [F-].[Ag+] |
| 30340 | silver monofluoride | [Ag+].[F-] |
| 51990 | tetrabutylammonium fluoride | [F-].CCCC[N+](CCCC)(CCCC)CCCC |
| 51990 | tetrabutylammonium fluoride | [F-].CCCC[N+](CCCC)(CCCC)CCCC |
| 51990 | tetrabutylammonium fluoride | CCCC[N+](CCCC)(CCCC)CCCC.[F-] |
| 51990 | tetrabutylammonium fluoride | C(C)CC[N+](CCCC)(CCCC)CCCC.[F-] |
| 51990 | tetrabutylammonium fluoride | [F-].C(C)CC[N+](CCCC)(CCCC)CCCC |
| 49499 | beryllium difluoride | F[Be]F |
| 49499 | beryllium difluoride | [Be](F)F |
| 49499 | beryllium difluoride | F[Be]F |