Scaffold encoding removes [*] attachment points, breaking scaffold decoration

## Summary

When using `SAFEConverter.encoder()` with scaffolds containing attachment points `[*]`, the attachment points are removed and replaced with ring closures, creating complete closed molecules. This breaks scaffold decoration because there's no attachment site left to decorate.

## Environment

- SAFE version: (from finetune_safe fork based on datamol-io/safe)
- Python version: 3.10
- Usage context: RL fine-tuning with scaffold-constrained generation

## Reproduction

```python
from safe import SAFEConverter
import safe as sf

encoder = SAFEConverter()

# Purine scaffold with attachment point
scaffold_smiles = 'O=c1[nH]cnc2nc([*])ccc12'

# Encode with fragmentation disabled (as done in scaffold_decoration)
with sf.utils.attr_as(encoder, 'slicer', None):
    encoded = encoder.encoder(scaffold_smiles, allow_empty=True)

print(f'Original:  {scaffold_smiles}')  # O=c1[nH]cnc2nc([*])ccc12
print(f'Encoded:   {encoded}')           # O=c1[nH]cnc2nc3ccc12  ← [*] became 3!
print(f'Decoded:   {encoder.decoder(encoded)}')  # O=c1[nH]cnc2ncccc12
```

**Output:**
```
Original:  O=c1[nH]cnc2nc([*])ccc12  ← Has [*] attachment point
Encoded:   O=c1[nH]cnc2nc3ccc12      ← [*] replaced with ring closure 3
Decoded:   O=c1[nH]cnc2ncccc12       ← Complete closed molecule
```

## Additional Test Cases

**Simple benzene:**
```python
scaffold = 'c1ccc([*])cc1'
# Encoded:  c1ccc2cc1    ← [*] removed
# Decoded:  c1ccccc1     ← Closed ring
```

**Multiple attachment points:**
```python
scaffold = 'c1cc([*])ccc1[*]'  # 2 attachment points
# Encoded:  c1cc2ccc13         ← Both [*] became ring closures
# Decoded:  c1ccccc1           ← All attachment points lost
```

## Impact

This issue affects the `scaffold_decoration()` method in `safe/sample.py`:

1. `scaffold_decoration()` calls `_completion()` (line 662)
2. `_completion()` encodes the scaffold using the same parameters (lines 325-332):
   ```python
   with sf.utils.attr_as(self.safe_encoder, "slicer", None):
       encoded_fragment = self.safe_encoder.encoder(
           fragment,
           canonical=False,
           randomize=True,
           constraints=None,
           allow_empty=True,  # ← Same parameter
           seed=new_seed,
       )
   ```
3. The encoded scaffold (with attachment points removed) is used as a generation prefix
4. The model generates from a complete closed molecule instead of decorating attachment points

**Result:** When using scaffold decoration for RL fine-tuning, the model generates minimal molecules (e.g., methane) instead of scaffold-containing structures because the scaffold is already complete.

## Questions

1. **Is this the intended behavior** for `scaffold_decoration()`?
2. **How should scaffolds with `[*]` be encoded** for use as generation prefixes while preserving attachment points?
3. **Is there a way to preserve attachment points** during SAFE encoding?
4. **Should we use a different approach** for scaffold-constrained generation in RL settings?

## Use Case

We're using SAFE for RL-based molecular optimization with scaffold constraints:

1. User provides scaffold SMILES with `[*]` attachment points
2. Scaffold should be SAFE-encoded for use as training prefix
3. Model should learn to generate molecules that **start with the scaffold** and add decorations at attachment points
4. Current behavior: Model generates from closed scaffolds, produces minimal molecules

Is scaffold decoration designed to work differently than we expect, or is this a bug in the encoding logic?

## Test Script

I've created a comprehensive test script that demonstrates the issue:

```python
#!/usr/bin/env python
"""Test SAFE scaffold encoding with attachment points."""

from safe import SAFEConverter
import safe as sf

encoder = SAFEConverter()

test_cases = [
    'O=c1[nH]cnc2nc([*])ccc12',  # Purine
    'c1ccc([*])cc1',              # Benzene
    'c1cc([*])ccc1[*]',           # Dual attachment
]

for scaffold in test_cases:
    with sf.utils.attr_as(encoder, 'slicer', None):
        encoded = encoder.encoder(scaffold, allow_empty=True)
    decoded = encoder.decoder(encoded)
    
    print(f"Original: {scaffold}")
    print(f"Encoded:  {encoded}")
    print(f"Decoded:  {decoded}")
    print(f"[*] preserved: {('[*]' in decoded)}")
    print()
```

All test cases lose their attachment points during encoding.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaffold encoding removes [*] attachment points, breaking scaffold decoration #67

Summary

Environment

Reproduction

Additional Test Cases

Impact

Questions

Use Case

Test Script

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scaffold encoding removes [*] attachment points, breaking scaffold decoration #67

Description

Summary

Environment

Reproduction

Additional Test Cases

Impact

Questions

Use Case

Test Script

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions