Issue 9: Verdict-generation and recommendation prompt layer

# Issue 9: Verdict-generation and recommendation prompt layer

## Description
Translate all forensics outputs into actionable pipeline recommendations. This is the layer that makes the project legible to a stakeholder who doesn't know what faithfulness scores mean.

The architecture is two-stage:

**Stage 1 — Rule-based recommendation matching:** A deterministic decision tree matches the combined forensics signal pattern to a root cause category and a named pipeline component. This is Python logic, not an LLM call.

**Stage 2 — LLM rendering:** Claude receives the matched rule, the raw metric values, and a rendering prompt, and writes one specific, concrete recommendation sentence. The LLM's job is fluency and specificity — not reasoning about what's wrong.

This separation matters. If Claude reasons from scratch each time, recommendations are inconsistent and sometimes wrong. If the decision tree matches the pattern and Claude only renders it, recommendations are reliable and the logic is inspectable.

## Stage 1: Recommendation matrix

```python
# prompts/recommendation_rules.py

from dataclasses import dataclass

@dataclass
class RecommendationRule:
    rule_id: str
    root_cause: str
    pipeline_component: str  # what to fix
    action: str              # how to fix it
    render_hint: str         # what to tell Claude to emphasize

RECOMMENDATION_RULES = [
    RecommendationRule(
        rule_id="R01",
        root_cause="ambiguous_retrieval_weak_generation",
        pipeline_component="top-k and reranker",
        action="reduce top-k or add a reranker to force selectivity before generation",
        render_hint="emphasize that the retriever couldn't decide what was relevant"
    ),
    RecommendationRule(
        rule_id="R02",
        root_cause="decisive_retrieval_generation_overstep",
        pipeline_component="chunk size",
        action="increase chunk size or use overlapping windows — the model is filling gaps with parametric knowledge",
        render_hint="emphasize that retrieval worked but chunks were too short"
    ),
    RecommendationRule(
        rule_id="R03",
        root_cause="decisive_retrieval_wrong_content",
        pipeline_component="chunk boundaries and embedding model",
        action="review chunk boundaries — relevant information may be split across chunks, or the embedding model may not suit this domain",
        render_hint="emphasize that the retriever was confident but retrieved the wrong thing"
    ),
    RecommendationRule(
        rule_id="R04",
        root_cause="overconfident_generation_on_weak_evidence",
        pipeline_component="prompt template",
        action="update the generation prompt to instruct the model to hedge when retrieved context is ambiguous",
        render_hint="emphasize that this is a prompt engineering problem, not a retrieval problem"
    ),
    RecommendationRule(
        rule_id="R05",
        root_cause="noisy_context_reaching_generator",
        pipeline_component="similarity threshold and top-k",
        action="raise the similarity threshold or reduce top-k to cut low-relevance chunks before generation",
        render_hint="emphasize that too much borderline content is diluting the signal"
    ),
    RecommendationRule(
        rule_id="R06",
        root_cause="underconfident_generation_on_strong_evidence",
        pipeline_component="prompt template",
        action="update the generation prompt to allow confident assertion when retrieved context directly supports a claim",
        render_hint="emphasize that this erodes user trust unnecessarily"
    ),
    RecommendationRule(
        rule_id="R07",
        root_cause="pipeline_healthy",
        pipeline_component="none",
        action="no changes indicated",
        render_hint="emphasize what's working — decisive retrieval, faithful generation, calibrated confidence"
    ),
    RecommendationRule(
        rule_id="R08",
        root_cause="query_phrasing_mismatch",
        pipeline_component="user query",
        action="rephrase the question — the corpus contains relevant information but the query didn't match the embedding space",
        render_hint="emphasize that the system can answer something close to what they asked"
    ),
    RecommendationRule(
        rule_id="R09",
        root_cause="corpus_coverage_gap",
        pipeline_component="knowledge base",
        action="the corpus does not appear to cover this topic — review the suggested questions to understand what the system can answer",
        render_hint="emphasize that this is a data coverage problem, not a query problem"
    ),
]

def get_rule(rule_id: str) -> RecommendationRule:
    return next(r for r in RECOMMENDATION_RULES if r.rule_id == rule_id)
```

### Rule matching logic

```python
# services/verdict_generator.py

def match_rule(
    distribution: RetrievalDistributionMetrics,
    embedding: EmbeddingSpaceMetrics,
    faithfulness_score: float,
    attribution: ChunkAttributionMetrics,
    hedging_mismatch: HedgingMismatchMetrics,
    query_fit: QueryCorpusFitMetrics,
) -> RecommendationRule:

    overconfident = hedging_mismatch.overconfident_fraction > 0.2
    underconfident = hedging_mismatch.underconfident_fraction > 0.2

    # R08: query phrasing mismatch — corpus has it, query didn't land right
    if query_fit.triggered and query_fit.mismatch_type == "query_mismatch":
        return get_rule("R08")

    # R09: coverage gap — corpus genuinely doesn't cover the topic
    if query_fit.triggered and query_fit.mismatch_type == "coverage_gap":
        return get_rule("R09")

    # R01: ambiguous retrieval + weak generation
    # reinforced by high chunk_spread from embedding analysis
    if distribution.score_entropy > 1.5 and faithfulness_score < 0.6:
        return get_rule("R01")
    if distribution.score_entropy > 1.5 and embedding.chunk_spread > 0.3:
        return get_rule("R01")

    # R02: decisive retrieval + generation overstepped chunks
    if distribution.score_gap > 0.2 and attribution.unattributed_fraction > 0.25:
        return get_rule("R02")

    # R03: decisive retrieval but wrong content
    # reinforced by high query_isolation from embedding analysis
    if distribution.score_gap > 0.2 and faithfulness_score < 0.5 and attribution.unattributed_fraction < 0.25:
        return get_rule("R03")
    if distribution.score_gap > 0.2 and embedding.query_isolation > 1.2 and faithfulness_score < 0.6:
        return get_rule("R03")

    # R04: flat distribution + overconfident generation
    if distribution.decay_rate < 0.1 and overconfident:
        return get_rule("R04")

    # R05: noisy context reaching generator
    if distribution.tail_mass > 0.4 and attribution.weak_match_fraction > 0.5:
        return get_rule("R05")

    # R06: underconfident generation on strong retrieval
    if distribution.score_gap > 0.15 and faithfulness_score > 0.75 and underconfident:
        return get_rule("R06")

    # R07: healthy pipeline — always the fallback
    return get_rule("R07")
```

**Important:** the thresholds above (1.5, 0.3, 0.2, 0.25, 0.5, 1.2, 0.6, 0.1, 0.4, 0.15, 0.75) are starting points. Calibrate empirically against RAGBench examples as part of this issue. Document what you tried and why in a comment next to each threshold — that calibration story is valuable in the README and in interviews.

## Stage 2: LLM rendering prompts

```python
# prompts/verdict_prompts.py

RECOMMENDATION_RENDER_PROMPT = """
You are writing one sentence for a RAG diagnostic report.

The RAG system has been analyzed. Here are the key signals:
- Retrieval score entropy: {score_entropy:.2f} (higher = more ambiguous retrieval)
- Score gap (top vs second chunk): {score_gap:.2f} (higher = more decisive retrieval)
- Decay rate: {decay_rate:.2f} (higher = steeper relevance drop-off)
- Tail mass: {tail_mass:.2f} (higher = more low-relevance content reaching generator)
- Centroid distance: {centroid_distance:.2f} (higher = query geometrically far from retrieved content)
- Chunk spread: {chunk_spread:.2f} (higher = retrieved chunks from different semantic regions)
- Query isolation: {query_isolation:.2f} (> 1.0 = query more isolated than chunks are from each other)
- Answer faithfulness: {faithfulness_score:.2f}
- Unattributed fraction: {unattributed_fraction:.2f} (fraction of answer not traceable to any chunk)
- Weak match fraction: {weak_match_fraction:.2f} (fraction of answer loosely but not strongly grounded)
- Overconfident claim fraction: {overconfident_fraction:.2f}
- Underconfident claim fraction: {underconfident_fraction:.2f}

Root cause identified: {root_cause}
Pipeline component to address: {pipeline_component}
Recommended action: {action}
Emphasis: {render_hint}

Write exactly one sentence that:
1. Names what the signals show is happening in the pipeline
2. Names the specific component to fix
3. States the specific action to take

Do not use hedging language. Do not say "may" or "might". Be direct and specific.
Maximum 50 words.
Example of the right register: "Your retrieval is decisive but generation is going beyond retrieved content — increase chunk size or use overlapping windows to give the model more grounding material."
"""

DIMENSION_EXPLANATION_PROMPT = """
Write one plain-English sentence explaining this RAG evaluation signal to a non-technical stakeholder.

Signal: {dimension_name}
Value: {metric_value}
What it measures: {what_it_measures}

Do not use ML jargon. Do not mention scores or numbers unless essential.
Maximum 30 words.
"""
```

## Implementation details

- Create `services/verdict_generator.py` with `match_rule()` and `render_recommendation(rule, all_metrics) -> str`
- Create `prompts/verdict_prompts.py` with prompt constants
- Create `prompts/recommendation_rules.py` with `RECOMMENDATION_RULES` and `get_rule()`
- Wire into `/analyze`: after all forensics run, call `match_rule()` → `render_recommendation()` → add to `AnalyzeResponse`
- Add `recommendation: str` and `rule_id: str` to `AnalyzeResponse` so the frontend can surface it prominently at the top of the diagnostic card

## Acceptance criteria
- [ ] `match_rule()` accepts all forensics metric objects (including `QueryCorpusFitMetrics`) and returns a `RecommendationRule`
- [ ] `match_rule()` always returns R07 when all signals are healthy — never crashes or returns None
- [ ] Every rule R01–R09 in `RECOMMENDATION_RULES` is reachable by some combination of inputs
- [ ] R08 is returned when `query_fit.triggered=True` and `mismatch_type="query_mismatch"`
- [ ] R09 is returned when `query_fit.triggered=True` and `mismatch_type="coverage_gap"`
- [ ] R08/R09 are checked before R01–R06 in the decision tree
- [ ] `render_recommendation()` returns a string under 50 words
- [ ] Rendered recommendation names a specific pipeline component and a specific action
- [ ] All prompt templates in `prompts/verdict_prompts.py` — no inline strings
- [ ] All rules in `prompts/recommendation_rules.py` — no inline strings
- [ ] `recommendation` and `rule_id` fields added to `AnalyzeResponse`
- [ ] Claude API failure falls back to the raw `action` string from the matched rule — never crashes
- [ ] All thresholds in `match_rule()` have a comment documenting the empirical basis

## TDD approach

```
tests/test_verdict_generator.py
```

Construct synthetic metric objects directly — no real forensics calls needed.

Tests to write before implementation:
1. High entropy + low faithfulness → R01
2. High entropy + high chunk_spread → R01
3. High score_gap + high unattributed_fraction → R02
4. High score_gap + low faithfulness + low unattributed_fraction → R03
5. High score_gap + high query_isolation + low faithfulness → R03
6. Low decay_rate + high overconfident_fraction → R04
7. High tail_mass + high weak_match_fraction → R05
8. High score_gap + high faithfulness + high underconfident_fraction → R06
9. All signals healthy → R07
10. query_fit triggered + mismatch_type="query_mismatch" → R08
11. query_fit triggered + mismatch_type="coverage_gap" → R09
12. R08/R09 checked before R01 (triggered query_fit with entropy > 1.5 → R08/R09, not R01)
13. `render_recommendation()` returns string under 50 words
14. Claude failure → falls back to `rule.action`, no exception
15. No input combination causes `match_rule()` to crash or return None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 9: Verdict-generation and recommendation prompt layer #9

Issue 9: Verdict-generation and recommendation prompt layer

Description

Stage 1: Recommendation matrix

Rule matching logic

Stage 2: LLM rendering prompts

Implementation details

Acceptance criteria

TDD approach

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue 9: Verdict-generation and recommendation prompt layer #9

Description

Issue 9: Verdict-generation and recommendation prompt layer

Description

Stage 1: Recommendation matrix

Rule matching logic

Stage 2: LLM rendering prompts

Implementation details

Acceptance criteria

TDD approach

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions