Feature request: Implement UPDATE/MERGE/DELETE curator operations to address playbook bloat

## Context

We are evaluating ACE for lending compliance review task(domain task) (50 train / 15 val / 25 test cases, Gemini 2.0 Flash, offline mode, 3 epochs). The framework works well — we achieved +13.3% validation improvement and +8% test improvement with an evolved playbook.

However, we observed significant **validation accuracy oscillation** during training that we traced to playbook bloat caused by the Curator only supporting ADD operations.

## Observation: Validation Oscillation

Our validation trajectory over 3 epochs:

| Checkpoint | Val Accuracy | Playbook Size |
|-----------|-------------|---------------|
| Epoch 1 Step 25 | 80.0% | ~2,000 tokens |
| Epoch 1 Step 50 | 60.0% | ~4,000 tokens |
| **Epoch 2 Step 25** | **93.3%** | ~10,000 tokens |
| Epoch 2 Step 50 | 53.3% | ~15,000 tokens |
| Epoch 3 Step 25 | 73.3% | ~17,000 tokens |
| Epoch 3 Step 50 | 66.7% | ~20,000 tokens |

Val accuracy peaks mid-epoch then drops sharply. The best playbook (93.3%) was correctly saved by the best-playbook mechanism, but subsequent training degrades performance rather than improving it.

## Root Cause: Playbook Bloat

Our best playbook (epoch 2 step 25) has **102 bullets**. Breakdown:

| Category | Count | Description |
|----------|-------|-------------|
| High performing | 20 | helpful > 5, harmful < 2 — the valuable bullets |
| Unused | 45 | helpful = 0, harmful = 0 — never cited by Generator |
| Low signal | 35 | Cited a few times, unclear value |
| Problematic | 2 | harmful >= helpful — actively hurting performance |

**44% of bullets are dead weight and 2% are actively harmful**, but they remain in the playbook because the Curator can only ADD. The Generator must process all 102 bullets in its context window, which dilutes the signal from the 20 high-performing bullets.

## Current State in Code

`playbook_utils.py:96-104` documents the gap:

```python
def apply_curator_operations(playbook_text, operations, next_id):
    """
    TODO: Future Operations (not implemented yet)
    - UPDATE: Rewrite existing bullets to be more accurate or comprehensive
    - MERGE: Combine related bullets into stronger ones
    - CREATE_META: Add high-level strategy sections
    - DELETE: Remove outdated or incorrect bullets (if needed)
    """
```

The Curator prompt only offers ADD as an available operation. Even if the LLM suggested DELETE/MERGE, `apply_curator_operations` would not execute it.

## Suggested Improvements

We think these would meaningfully improve multi-epoch training stability:

1. **DELETE**: Remove bullets where `harmful > helpful` (the 2 problematic bullets in our case). The helpful/harmful scoring data already exists — this is low-hanging fruit.

2. **MERGE**: Combine bullets with similar content in the same section. The BulletpointAnalyzer already does similarity detection — extending this to the Curator would give the LLM control over consolidation.

3. **PRUNE unused**: After N steps, remove bullets with `helpful = 0, harmful = 0`. If the Generator never cites a bullet across 50+ training samples, it's not contributing.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Implement UPDATE/MERGE/DELETE curator operations to address playbook bloat #26

Context

Observation: Validation Oscillation

Root Cause: Playbook Bloat

Current State in Code

Suggested Improvements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Checkpoint	Val Accuracy	Playbook Size
Epoch 1 Step 25	80.0%	~2,000 tokens
Epoch 1 Step 50	60.0%	~4,000 tokens
Epoch 2 Step 25	93.3%	~10,000 tokens
Epoch 2 Step 50	53.3%	~15,000 tokens
Epoch 3 Step 25	73.3%	~17,000 tokens
Epoch 3 Step 50	66.7%	~20,000 tokens

Category	Count	Description
High performing	20	helpful > 5, harmful < 2 — the valuable bullets
Unused	45	helpful = 0, harmful = 0 — never cited by Generator
Low signal	35	Cited a few times, unclear value
Problematic	2	harmful >= helpful — actively hurting performance

Feature request: Implement UPDATE/MERGE/DELETE curator operations to address playbook bloat #26

Description

Context

Observation: Validation Oscillation

Root Cause: Playbook Bloat

Current State in Code

Suggested Improvements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions