Skip to content

Feature request: Implement UPDATE/MERGE/DELETE curator operations to address playbook bloat #26

@shuknk8s

Description

@shuknk8s

Context

We are evaluating ACE for lending compliance review task(domain task) (50 train / 15 val / 25 test cases, Gemini 2.0 Flash, offline mode, 3 epochs). The framework works well — we achieved +13.3% validation improvement and +8% test improvement with an evolved playbook.

However, we observed significant validation accuracy oscillation during training that we traced to playbook bloat caused by the Curator only supporting ADD operations.

Observation: Validation Oscillation

Our validation trajectory over 3 epochs:

Checkpoint Val Accuracy Playbook Size
Epoch 1 Step 25 80.0% ~2,000 tokens
Epoch 1 Step 50 60.0% ~4,000 tokens
Epoch 2 Step 25 93.3% ~10,000 tokens
Epoch 2 Step 50 53.3% ~15,000 tokens
Epoch 3 Step 25 73.3% ~17,000 tokens
Epoch 3 Step 50 66.7% ~20,000 tokens

Val accuracy peaks mid-epoch then drops sharply. The best playbook (93.3%) was correctly saved by the best-playbook mechanism, but subsequent training degrades performance rather than improving it.

Root Cause: Playbook Bloat

Our best playbook (epoch 2 step 25) has 102 bullets. Breakdown:

Category Count Description
High performing 20 helpful > 5, harmful < 2 — the valuable bullets
Unused 45 helpful = 0, harmful = 0 — never cited by Generator
Low signal 35 Cited a few times, unclear value
Problematic 2 harmful >= helpful — actively hurting performance

44% of bullets are dead weight and 2% are actively harmful, but they remain in the playbook because the Curator can only ADD. The Generator must process all 102 bullets in its context window, which dilutes the signal from the 20 high-performing bullets.

Current State in Code

playbook_utils.py:96-104 documents the gap:

def apply_curator_operations(playbook_text, operations, next_id):
    """
    TODO: Future Operations (not implemented yet)
    - UPDATE: Rewrite existing bullets to be more accurate or comprehensive
    - MERGE: Combine related bullets into stronger ones
    - CREATE_META: Add high-level strategy sections
    - DELETE: Remove outdated or incorrect bullets (if needed)
    """

The Curator prompt only offers ADD as an available operation. Even if the LLM suggested DELETE/MERGE, apply_curator_operations would not execute it.

Suggested Improvements

We think these would meaningfully improve multi-epoch training stability:

  1. DELETE: Remove bullets where harmful > helpful (the 2 problematic bullets in our case). The helpful/harmful scoring data already exists — this is low-hanging fruit.

  2. MERGE: Combine bullets with similar content in the same section. The BulletpointAnalyzer already does similarity detection — extending this to the Curator would give the LLM control over consolidation.

  3. PRUNE unused: After N steps, remove bullets with helpful = 0, harmful = 0. If the Generator never cites a bullet across 50+ training samples, it's not contributing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions