Skip to content

Scaffold encoding removes [*] attachment points, breaking scaffold decoration #67

@naglemi

Description

@naglemi

Summary

When using SAFEConverter.encoder() with scaffolds containing attachment points [*], the attachment points are removed and replaced with ring closures, creating complete closed molecules. This breaks scaffold decoration because there's no attachment site left to decorate.

Environment

  • SAFE version: (from finetune_safe fork based on datamol-io/safe)
  • Python version: 3.10
  • Usage context: RL fine-tuning with scaffold-constrained generation

Reproduction

from safe import SAFEConverter
import safe as sf

encoder = SAFEConverter()

# Purine scaffold with attachment point
scaffold_smiles = 'O=c1[nH]cnc2nc([*])ccc12'

# Encode with fragmentation disabled (as done in scaffold_decoration)
with sf.utils.attr_as(encoder, 'slicer', None):
    encoded = encoder.encoder(scaffold_smiles, allow_empty=True)

print(f'Original:  {scaffold_smiles}')  # O=c1[nH]cnc2nc([*])ccc12
print(f'Encoded:   {encoded}')           # O=c1[nH]cnc2nc3ccc12  ← [*] became 3!
print(f'Decoded:   {encoder.decoder(encoded)}')  # O=c1[nH]cnc2ncccc12

Output:

Original:  O=c1[nH]cnc2nc([*])ccc12  ← Has [*] attachment point
Encoded:   O=c1[nH]cnc2nc3ccc12      ← [*] replaced with ring closure 3
Decoded:   O=c1[nH]cnc2ncccc12       ← Complete closed molecule

Additional Test Cases

Simple benzene:

scaffold = 'c1ccc([*])cc1'
# Encoded:  c1ccc2cc1    ← [*] removed
# Decoded:  c1ccccc1     ← Closed ring

Multiple attachment points:

scaffold = 'c1cc([*])ccc1[*]'  # 2 attachment points
# Encoded:  c1cc2ccc13         ← Both [*] became ring closures
# Decoded:  c1ccccc1           ← All attachment points lost

Impact

This issue affects the scaffold_decoration() method in safe/sample.py:

  1. scaffold_decoration() calls _completion() (line 662)
  2. _completion() encodes the scaffold using the same parameters (lines 325-332):
    with sf.utils.attr_as(self.safe_encoder, "slicer", None):
        encoded_fragment = self.safe_encoder.encoder(
            fragment,
            canonical=False,
            randomize=True,
            constraints=None,
            allow_empty=True,  # ← Same parameter
            seed=new_seed,
        )
  3. The encoded scaffold (with attachment points removed) is used as a generation prefix
  4. The model generates from a complete closed molecule instead of decorating attachment points

Result: When using scaffold decoration for RL fine-tuning, the model generates minimal molecules (e.g., methane) instead of scaffold-containing structures because the scaffold is already complete.

Questions

  1. Is this the intended behavior for scaffold_decoration()?
  2. How should scaffolds with [*] be encoded for use as generation prefixes while preserving attachment points?
  3. Is there a way to preserve attachment points during SAFE encoding?
  4. Should we use a different approach for scaffold-constrained generation in RL settings?

Use Case

We're using SAFE for RL-based molecular optimization with scaffold constraints:

  1. User provides scaffold SMILES with [*] attachment points
  2. Scaffold should be SAFE-encoded for use as training prefix
  3. Model should learn to generate molecules that start with the scaffold and add decorations at attachment points
  4. Current behavior: Model generates from closed scaffolds, produces minimal molecules

Is scaffold decoration designed to work differently than we expect, or is this a bug in the encoding logic?

Test Script

I've created a comprehensive test script that demonstrates the issue:

#!/usr/bin/env python
"""Test SAFE scaffold encoding with attachment points."""

from safe import SAFEConverter
import safe as sf

encoder = SAFEConverter()

test_cases = [
    'O=c1[nH]cnc2nc([*])ccc12',  # Purine
    'c1ccc([*])cc1',              # Benzene
    'c1cc([*])ccc1[*]',           # Dual attachment
]

for scaffold in test_cases:
    with sf.utils.attr_as(encoder, 'slicer', None):
        encoded = encoder.encoder(scaffold, allow_empty=True)
    decoded = encoder.decoder(encoded)
    
    print(f"Original: {scaffold}")
    print(f"Encoded:  {encoded}")
    print(f"Decoded:  {decoded}")
    print(f"[*] preserved: {('[*]' in decoded)}")
    print()

All test cases lose their attachment points during encoding.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions