-
Notifications
You must be signed in to change notification settings - Fork 0
Improve AIL test pass rates to meet targets #64
Description
Summary
The Agent-in-the-Loop (AIL) testing framework has been implemented in PR #63, but test pass rates don't yet meet the defined targets. This issue tracks the remaining work to improve agent performance.
Current vs Target Pass Rates
| Tier | Current | Target | Gap |
|---|---|---|---|
| Tier 1 | 67% (2/3) | >90% | -23% |
| Tier 2 | 67% (2/3) | >75% | -8% |
| Tier 3 | 100% (3/3) | >60% | ✅ Met |
Identified Issues
1. Search Ranking for Specific Content
Problem: The Art of War terrain/ground content exists in the database but hybrid search doesn't rank it highly enough for the agent to find it consistently.
Evidence: Direct database queries show 9+ chunks with terrain/ground content, but agent searches fail to surface them.
Potential Solutions:
- Investigate vector embedding quality for this content
- Review BM25 tokenization for chapter titles
- Consider boosting exact phrase matches
2. LLM Non-Determinism in Aggregate Tests
Problem: Aggregate tests re-run all scenarios independently, and LLM variance causes different results between runs, making pass rates inconsistent.
Potential Solutions:
- Cache individual test results for aggregate calculations
- Use lower temperature (currently 0.1)
- Run multiple iterations and use majority vote
3. Residual Search Narration
Problem: Despite explicit rules, the agent occasionally outputs search narration ("Let me try searching...") instead of synthesizing answers.
Evidence: Art of War formations test answer: 'Let me try searching for "six kinds" or "nine kinds" more directly:...'
Potential Solutions:
- Strengthen agent-quick-rules.md prohibitions
- Add post-processing to detect/reject narration
- Consider fine-tuned model or few-shot examples
4. Model Cost vs Performance Tradeoff
Observation: Claude Sonnet 4 performs better but costs more. Claude Haiku 4.5 is more cost-effective but less reliable at following complex instructions.
Potential Solutions:
- Use Sonnet for complex Tier 3 tasks, Haiku for simple Tier 1
- Investigate other models (GPT-4o, Gemini)
- Document cost/performance tradeoffs for users
Related
- PR feat: Add Agent-in-the-Loop E2E Testing #63 - Initial AIL implementation
- Issue Add Agent-in-the-loop E2E Testing #49 - Original AIL testing issue
- ADR 0055 - Architecture decision for AIL testing
Files to Investigate
src/infrastructure/search/conceptual-hybrid-search-service.ts- Hybrid search rankingprompts/agent-quick-rules.md- Agent behavior rulessrc/__tests__/ail/config.ts- Model configuration
Success Criteria
- Tier 1 pass rate >90%
- Tier 2 pass rate >75%
- Tool selection correctness >85%
- No search narration in final answers