Context
We are evaluating ACE for lending compliance review task(domain task) (50 train / 15 val / 25 test cases, Gemini 2.0 Flash, offline mode, 3 epochs). The framework works well — we achieved +13.3% validation improvement and +8% test improvement with an evolved playbook.
However, we observed significant validation accuracy oscillation during training that we traced to playbook bloat caused by the Curator only supporting ADD operations.
Observation: Validation Oscillation
Our validation trajectory over 3 epochs:
| Checkpoint |
Val Accuracy |
Playbook Size |
| Epoch 1 Step 25 |
80.0% |
~2,000 tokens |
| Epoch 1 Step 50 |
60.0% |
~4,000 tokens |
| Epoch 2 Step 25 |
93.3% |
~10,000 tokens |
| Epoch 2 Step 50 |
53.3% |
~15,000 tokens |
| Epoch 3 Step 25 |
73.3% |
~17,000 tokens |
| Epoch 3 Step 50 |
66.7% |
~20,000 tokens |
Val accuracy peaks mid-epoch then drops sharply. The best playbook (93.3%) was correctly saved by the best-playbook mechanism, but subsequent training degrades performance rather than improving it.
Root Cause: Playbook Bloat
Our best playbook (epoch 2 step 25) has 102 bullets. Breakdown:
| Category |
Count |
Description |
| High performing |
20 |
helpful > 5, harmful < 2 — the valuable bullets |
| Unused |
45 |
helpful = 0, harmful = 0 — never cited by Generator |
| Low signal |
35 |
Cited a few times, unclear value |
| Problematic |
2 |
harmful >= helpful — actively hurting performance |
44% of bullets are dead weight and 2% are actively harmful, but they remain in the playbook because the Curator can only ADD. The Generator must process all 102 bullets in its context window, which dilutes the signal from the 20 high-performing bullets.
Current State in Code
playbook_utils.py:96-104 documents the gap:
def apply_curator_operations(playbook_text, operations, next_id):
"""
TODO: Future Operations (not implemented yet)
- UPDATE: Rewrite existing bullets to be more accurate or comprehensive
- MERGE: Combine related bullets into stronger ones
- CREATE_META: Add high-level strategy sections
- DELETE: Remove outdated or incorrect bullets (if needed)
"""
The Curator prompt only offers ADD as an available operation. Even if the LLM suggested DELETE/MERGE, apply_curator_operations would not execute it.
Suggested Improvements
We think these would meaningfully improve multi-epoch training stability:
-
DELETE: Remove bullets where harmful > helpful (the 2 problematic bullets in our case). The helpful/harmful scoring data already exists — this is low-hanging fruit.
-
MERGE: Combine bullets with similar content in the same section. The BulletpointAnalyzer already does similarity detection — extending this to the Curator would give the LLM control over consolidation.
-
PRUNE unused: After N steps, remove bullets with helpful = 0, harmful = 0. If the Generator never cites a bullet across 50+ training samples, it's not contributing.
Context
We are evaluating ACE for lending compliance review task(domain task) (50 train / 15 val / 25 test cases, Gemini 2.0 Flash, offline mode, 3 epochs). The framework works well — we achieved +13.3% validation improvement and +8% test improvement with an evolved playbook.
However, we observed significant validation accuracy oscillation during training that we traced to playbook bloat caused by the Curator only supporting ADD operations.
Observation: Validation Oscillation
Our validation trajectory over 3 epochs:
Val accuracy peaks mid-epoch then drops sharply. The best playbook (93.3%) was correctly saved by the best-playbook mechanism, but subsequent training degrades performance rather than improving it.
Root Cause: Playbook Bloat
Our best playbook (epoch 2 step 25) has 102 bullets. Breakdown:
44% of bullets are dead weight and 2% are actively harmful, but they remain in the playbook because the Curator can only ADD. The Generator must process all 102 bullets in its context window, which dilutes the signal from the 20 high-performing bullets.
Current State in Code
playbook_utils.py:96-104documents the gap:The Curator prompt only offers ADD as an available operation. Even if the LLM suggested DELETE/MERGE,
apply_curator_operationswould not execute it.Suggested Improvements
We think these would meaningfully improve multi-epoch training stability:
DELETE: Remove bullets where
harmful > helpful(the 2 problematic bullets in our case). The helpful/harmful scoring data already exists — this is low-hanging fruit.MERGE: Combine bullets with similar content in the same section. The BulletpointAnalyzer already does similarity detection — extending this to the Curator would give the LLM control over consolidation.
PRUNE unused: After N steps, remove bullets with
helpful = 0, harmful = 0. If the Generator never cites a bullet across 50+ training samples, it's not contributing.