Improve AIL test pass rates to meet targets

## Summary

The Agent-in-the-Loop (AIL) testing framework has been implemented in PR #63, but test pass rates don't yet meet the defined targets. This issue tracks the remaining work to improve agent performance.

## Current vs Target Pass Rates

| Tier | Current | Target | Gap |
|------|---------|--------|-----|
| Tier 1 | 67% (2/3) | >90% | -23% |
| Tier 2 | 67% (2/3) | >75% | -8% |
| Tier 3 | 100% (3/3) | >60% | ✅ Met |

## Identified Issues

### 1. Search Ranking for Specific Content
**Problem:** The Art of War terrain/ground content exists in the database but hybrid search doesn't rank it highly enough for the agent to find it consistently.

**Evidence:** Direct database queries show 9+ chunks with terrain/ground content, but agent searches fail to surface them.

**Potential Solutions:**
- Investigate vector embedding quality for this content
- Review BM25 tokenization for chapter titles
- Consider boosting exact phrase matches

### 2. LLM Non-Determinism in Aggregate Tests
**Problem:** Aggregate tests re-run all scenarios independently, and LLM variance causes different results between runs, making pass rates inconsistent.

**Potential Solutions:**
- Cache individual test results for aggregate calculations
- Use lower temperature (currently 0.1)
- Run multiple iterations and use majority vote

### 3. Residual Search Narration
**Problem:** Despite explicit rules, the agent occasionally outputs search narration ("Let me try searching...") instead of synthesizing answers.

**Evidence:** Art of War formations test answer: `'Let me try searching for "six kinds" or "nine kinds" more directly:...'`

**Potential Solutions:**
- Strengthen agent-quick-rules.md prohibitions
- Add post-processing to detect/reject narration
- Consider fine-tuned model or few-shot examples

### 4. Model Cost vs Performance Tradeoff
**Observation:** Claude Sonnet 4 performs better but costs more. Claude Haiku 4.5 is more cost-effective but less reliable at following complex instructions.

**Potential Solutions:**
- Use Sonnet for complex Tier 3 tasks, Haiku for simple Tier 1
- Investigate other models (GPT-4o, Gemini)
- Document cost/performance tradeoffs for users

## Related

- PR #63 - Initial AIL implementation
- Issue #49 - Original AIL testing issue
- ADR 0055 - Architecture decision for AIL testing

## Files to Investigate

- `src/infrastructure/search/conceptual-hybrid-search-service.ts` - Hybrid search ranking
- `prompts/agent-quick-rules.md` - Agent behavior rules
- `src/__tests__/ail/config.ts` - Model configuration

## Success Criteria

- [ ] Tier 1 pass rate >90%
- [ ] Tier 2 pass rate >75%
- [ ] Tool selection correctness >85%
- [ ] No search narration in final answers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve AIL test pass rates to meet targets #64

Summary

Current vs Target Pass Rates

Identified Issues

1. Search Ranking for Specific Content

2. LLM Non-Determinism in Aggregate Tests

3. Residual Search Narration

4. Model Cost vs Performance Tradeoff

Related

Files to Investigate

Success Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Tier	Current	Target	Gap
Tier 1	67% (2/3)	>90%	-23%
Tier 2	67% (2/3)	>75%	-8%
Tier 3	100% (3/3)	>60%	✅ Met

Improve AIL test pass rates to meet targets #64

Description

Summary

Current vs Target Pass Rates

Identified Issues

1. Search Ranking for Specific Content

2. LLM Non-Determinism in Aggregate Tests

3. Residual Search Narration

4. Model Cost vs Performance Tradeoff

Related

Files to Investigate

Success Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions