Skip to content

Improve AIL test pass rates to meet targets #64

@m2ux

Description

@m2ux

Summary

The Agent-in-the-Loop (AIL) testing framework has been implemented in PR #63, but test pass rates don't yet meet the defined targets. This issue tracks the remaining work to improve agent performance.

Current vs Target Pass Rates

Tier Current Target Gap
Tier 1 67% (2/3) >90% -23%
Tier 2 67% (2/3) >75% -8%
Tier 3 100% (3/3) >60% ✅ Met

Identified Issues

1. Search Ranking for Specific Content

Problem: The Art of War terrain/ground content exists in the database but hybrid search doesn't rank it highly enough for the agent to find it consistently.

Evidence: Direct database queries show 9+ chunks with terrain/ground content, but agent searches fail to surface them.

Potential Solutions:

  • Investigate vector embedding quality for this content
  • Review BM25 tokenization for chapter titles
  • Consider boosting exact phrase matches

2. LLM Non-Determinism in Aggregate Tests

Problem: Aggregate tests re-run all scenarios independently, and LLM variance causes different results between runs, making pass rates inconsistent.

Potential Solutions:

  • Cache individual test results for aggregate calculations
  • Use lower temperature (currently 0.1)
  • Run multiple iterations and use majority vote

3. Residual Search Narration

Problem: Despite explicit rules, the agent occasionally outputs search narration ("Let me try searching...") instead of synthesizing answers.

Evidence: Art of War formations test answer: 'Let me try searching for "six kinds" or "nine kinds" more directly:...'

Potential Solutions:

  • Strengthen agent-quick-rules.md prohibitions
  • Add post-processing to detect/reject narration
  • Consider fine-tuned model or few-shot examples

4. Model Cost vs Performance Tradeoff

Observation: Claude Sonnet 4 performs better but costs more. Claude Haiku 4.5 is more cost-effective but less reliable at following complex instructions.

Potential Solutions:

  • Use Sonnet for complex Tier 3 tasks, Haiku for simple Tier 1
  • Investigate other models (GPT-4o, Gemini)
  • Document cost/performance tradeoffs for users

Related

Files to Investigate

  • src/infrastructure/search/conceptual-hybrid-search-service.ts - Hybrid search ranking
  • prompts/agent-quick-rules.md - Agent behavior rules
  • src/__tests__/ail/config.ts - Model configuration

Success Criteria

  • Tier 1 pass rate >90%
  • Tier 2 pass rate >75%
  • Tool selection correctness >85%
  • No search narration in final answers

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions