-
Notifications
You must be signed in to change notification settings - Fork 18
Description
Summary
When using SAFEConverter.encoder() with scaffolds containing attachment points [*], the attachment points are removed and replaced with ring closures, creating complete closed molecules. This breaks scaffold decoration because there's no attachment site left to decorate.
Environment
- SAFE version: (from finetune_safe fork based on datamol-io/safe)
- Python version: 3.10
- Usage context: RL fine-tuning with scaffold-constrained generation
Reproduction
from safe import SAFEConverter
import safe as sf
encoder = SAFEConverter()
# Purine scaffold with attachment point
scaffold_smiles = 'O=c1[nH]cnc2nc([*])ccc12'
# Encode with fragmentation disabled (as done in scaffold_decoration)
with sf.utils.attr_as(encoder, 'slicer', None):
encoded = encoder.encoder(scaffold_smiles, allow_empty=True)
print(f'Original: {scaffold_smiles}') # O=c1[nH]cnc2nc([*])ccc12
print(f'Encoded: {encoded}') # O=c1[nH]cnc2nc3ccc12 ← [*] became 3!
print(f'Decoded: {encoder.decoder(encoded)}') # O=c1[nH]cnc2ncccc12Output:
Original: O=c1[nH]cnc2nc([*])ccc12 ← Has [*] attachment point
Encoded: O=c1[nH]cnc2nc3ccc12 ← [*] replaced with ring closure 3
Decoded: O=c1[nH]cnc2ncccc12 ← Complete closed molecule
Additional Test Cases
Simple benzene:
scaffold = 'c1ccc([*])cc1'
# Encoded: c1ccc2cc1 ← [*] removed
# Decoded: c1ccccc1 ← Closed ringMultiple attachment points:
scaffold = 'c1cc([*])ccc1[*]' # 2 attachment points
# Encoded: c1cc2ccc13 ← Both [*] became ring closures
# Decoded: c1ccccc1 ← All attachment points lostImpact
This issue affects the scaffold_decoration() method in safe/sample.py:
scaffold_decoration()calls_completion()(line 662)_completion()encodes the scaffold using the same parameters (lines 325-332):with sf.utils.attr_as(self.safe_encoder, "slicer", None): encoded_fragment = self.safe_encoder.encoder( fragment, canonical=False, randomize=True, constraints=None, allow_empty=True, # ← Same parameter seed=new_seed, )
- The encoded scaffold (with attachment points removed) is used as a generation prefix
- The model generates from a complete closed molecule instead of decorating attachment points
Result: When using scaffold decoration for RL fine-tuning, the model generates minimal molecules (e.g., methane) instead of scaffold-containing structures because the scaffold is already complete.
Questions
- Is this the intended behavior for
scaffold_decoration()? - How should scaffolds with
[*]be encoded for use as generation prefixes while preserving attachment points? - Is there a way to preserve attachment points during SAFE encoding?
- Should we use a different approach for scaffold-constrained generation in RL settings?
Use Case
We're using SAFE for RL-based molecular optimization with scaffold constraints:
- User provides scaffold SMILES with
[*]attachment points - Scaffold should be SAFE-encoded for use as training prefix
- Model should learn to generate molecules that start with the scaffold and add decorations at attachment points
- Current behavior: Model generates from closed scaffolds, produces minimal molecules
Is scaffold decoration designed to work differently than we expect, or is this a bug in the encoding logic?
Test Script
I've created a comprehensive test script that demonstrates the issue:
#!/usr/bin/env python
"""Test SAFE scaffold encoding with attachment points."""
from safe import SAFEConverter
import safe as sf
encoder = SAFEConverter()
test_cases = [
'O=c1[nH]cnc2nc([*])ccc12', # Purine
'c1ccc([*])cc1', # Benzene
'c1cc([*])ccc1[*]', # Dual attachment
]
for scaffold in test_cases:
with sf.utils.attr_as(encoder, 'slicer', None):
encoded = encoder.encoder(scaffold, allow_empty=True)
decoded = encoder.decoder(encoded)
print(f"Original: {scaffold}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")
print(f"[*] preserved: {('[*]' in decoded)}")
print()All test cases lose their attachment points during encoding.