Issue 9: Verdict-generation and recommendation prompt layer
Description
Translate all forensics outputs into actionable pipeline recommendations. This is the layer that makes the project legible to a stakeholder who doesn't know what faithfulness scores mean.
The architecture is two-stage:
Stage 1 — Rule-based recommendation matching: A deterministic decision tree matches the combined forensics signal pattern to a root cause category and a named pipeline component. This is Python logic, not an LLM call.
Stage 2 — LLM rendering: Claude receives the matched rule, the raw metric values, and a rendering prompt, and writes one specific, concrete recommendation sentence. The LLM's job is fluency and specificity — not reasoning about what's wrong.
This separation matters. If Claude reasons from scratch each time, recommendations are inconsistent and sometimes wrong. If the decision tree matches the pattern and Claude only renders it, recommendations are reliable and the logic is inspectable.
Stage 1: Recommendation matrix
# prompts/recommendation_rules.py
from dataclasses import dataclass
@dataclass
class RecommendationRule:
rule_id: str
root_cause: str
pipeline_component: str # what to fix
action: str # how to fix it
render_hint: str # what to tell Claude to emphasize
RECOMMENDATION_RULES = [
RecommendationRule(
rule_id="R01",
root_cause="ambiguous_retrieval_weak_generation",
pipeline_component="top-k and reranker",
action="reduce top-k or add a reranker to force selectivity before generation",
render_hint="emphasize that the retriever couldn't decide what was relevant"
),
RecommendationRule(
rule_id="R02",
root_cause="decisive_retrieval_generation_overstep",
pipeline_component="chunk size",
action="increase chunk size or use overlapping windows — the model is filling gaps with parametric knowledge",
render_hint="emphasize that retrieval worked but chunks were too short"
),
RecommendationRule(
rule_id="R03",
root_cause="decisive_retrieval_wrong_content",
pipeline_component="chunk boundaries and embedding model",
action="review chunk boundaries — relevant information may be split across chunks, or the embedding model may not suit this domain",
render_hint="emphasize that the retriever was confident but retrieved the wrong thing"
),
RecommendationRule(
rule_id="R04",
root_cause="overconfident_generation_on_weak_evidence",
pipeline_component="prompt template",
action="update the generation prompt to instruct the model to hedge when retrieved context is ambiguous",
render_hint="emphasize that this is a prompt engineering problem, not a retrieval problem"
),
RecommendationRule(
rule_id="R05",
root_cause="noisy_context_reaching_generator",
pipeline_component="similarity threshold and top-k",
action="raise the similarity threshold or reduce top-k to cut low-relevance chunks before generation",
render_hint="emphasize that too much borderline content is diluting the signal"
),
RecommendationRule(
rule_id="R06",
root_cause="underconfident_generation_on_strong_evidence",
pipeline_component="prompt template",
action="update the generation prompt to allow confident assertion when retrieved context directly supports a claim",
render_hint="emphasize that this erodes user trust unnecessarily"
),
RecommendationRule(
rule_id="R07",
root_cause="pipeline_healthy",
pipeline_component="none",
action="no changes indicated",
render_hint="emphasize what's working — decisive retrieval, faithful generation, calibrated confidence"
),
RecommendationRule(
rule_id="R08",
root_cause="query_phrasing_mismatch",
pipeline_component="user query",
action="rephrase the question — the corpus contains relevant information but the query didn't match the embedding space",
render_hint="emphasize that the system can answer something close to what they asked"
),
RecommendationRule(
rule_id="R09",
root_cause="corpus_coverage_gap",
pipeline_component="knowledge base",
action="the corpus does not appear to cover this topic — review the suggested questions to understand what the system can answer",
render_hint="emphasize that this is a data coverage problem, not a query problem"
),
]
def get_rule(rule_id: str) -> RecommendationRule:
return next(r for r in RECOMMENDATION_RULES if r.rule_id == rule_id)
Rule matching logic
# services/verdict_generator.py
def match_rule(
distribution: RetrievalDistributionMetrics,
embedding: EmbeddingSpaceMetrics,
faithfulness_score: float,
attribution: ChunkAttributionMetrics,
hedging_mismatch: HedgingMismatchMetrics,
query_fit: QueryCorpusFitMetrics,
) -> RecommendationRule:
overconfident = hedging_mismatch.overconfident_fraction > 0.2
underconfident = hedging_mismatch.underconfident_fraction > 0.2
# R08: query phrasing mismatch — corpus has it, query didn't land right
if query_fit.triggered and query_fit.mismatch_type == "query_mismatch":
return get_rule("R08")
# R09: coverage gap — corpus genuinely doesn't cover the topic
if query_fit.triggered and query_fit.mismatch_type == "coverage_gap":
return get_rule("R09")
# R01: ambiguous retrieval + weak generation
# reinforced by high chunk_spread from embedding analysis
if distribution.score_entropy > 1.5 and faithfulness_score < 0.6:
return get_rule("R01")
if distribution.score_entropy > 1.5 and embedding.chunk_spread > 0.3:
return get_rule("R01")
# R02: decisive retrieval + generation overstepped chunks
if distribution.score_gap > 0.2 and attribution.unattributed_fraction > 0.25:
return get_rule("R02")
# R03: decisive retrieval but wrong content
# reinforced by high query_isolation from embedding analysis
if distribution.score_gap > 0.2 and faithfulness_score < 0.5 and attribution.unattributed_fraction < 0.25:
return get_rule("R03")
if distribution.score_gap > 0.2 and embedding.query_isolation > 1.2 and faithfulness_score < 0.6:
return get_rule("R03")
# R04: flat distribution + overconfident generation
if distribution.decay_rate < 0.1 and overconfident:
return get_rule("R04")
# R05: noisy context reaching generator
if distribution.tail_mass > 0.4 and attribution.weak_match_fraction > 0.5:
return get_rule("R05")
# R06: underconfident generation on strong retrieval
if distribution.score_gap > 0.15 and faithfulness_score > 0.75 and underconfident:
return get_rule("R06")
# R07: healthy pipeline — always the fallback
return get_rule("R07")
Important: the thresholds above (1.5, 0.3, 0.2, 0.25, 0.5, 1.2, 0.6, 0.1, 0.4, 0.15, 0.75) are starting points. Calibrate empirically against RAGBench examples as part of this issue. Document what you tried and why in a comment next to each threshold — that calibration story is valuable in the README and in interviews.
Stage 2: LLM rendering prompts
# prompts/verdict_prompts.py
RECOMMENDATION_RENDER_PROMPT = """
You are writing one sentence for a RAG diagnostic report.
The RAG system has been analyzed. Here are the key signals:
- Retrieval score entropy: {score_entropy:.2f} (higher = more ambiguous retrieval)
- Score gap (top vs second chunk): {score_gap:.2f} (higher = more decisive retrieval)
- Decay rate: {decay_rate:.2f} (higher = steeper relevance drop-off)
- Tail mass: {tail_mass:.2f} (higher = more low-relevance content reaching generator)
- Centroid distance: {centroid_distance:.2f} (higher = query geometrically far from retrieved content)
- Chunk spread: {chunk_spread:.2f} (higher = retrieved chunks from different semantic regions)
- Query isolation: {query_isolation:.2f} (> 1.0 = query more isolated than chunks are from each other)
- Answer faithfulness: {faithfulness_score:.2f}
- Unattributed fraction: {unattributed_fraction:.2f} (fraction of answer not traceable to any chunk)
- Weak match fraction: {weak_match_fraction:.2f} (fraction of answer loosely but not strongly grounded)
- Overconfident claim fraction: {overconfident_fraction:.2f}
- Underconfident claim fraction: {underconfident_fraction:.2f}
Root cause identified: {root_cause}
Pipeline component to address: {pipeline_component}
Recommended action: {action}
Emphasis: {render_hint}
Write exactly one sentence that:
1. Names what the signals show is happening in the pipeline
2. Names the specific component to fix
3. States the specific action to take
Do not use hedging language. Do not say "may" or "might". Be direct and specific.
Maximum 50 words.
Example of the right register: "Your retrieval is decisive but generation is going beyond retrieved content — increase chunk size or use overlapping windows to give the model more grounding material."
"""
DIMENSION_EXPLANATION_PROMPT = """
Write one plain-English sentence explaining this RAG evaluation signal to a non-technical stakeholder.
Signal: {dimension_name}
Value: {metric_value}
What it measures: {what_it_measures}
Do not use ML jargon. Do not mention scores or numbers unless essential.
Maximum 30 words.
"""
Implementation details
- Create
services/verdict_generator.py with match_rule() and render_recommendation(rule, all_metrics) -> str
- Create
prompts/verdict_prompts.py with prompt constants
- Create
prompts/recommendation_rules.py with RECOMMENDATION_RULES and get_rule()
- Wire into
/analyze: after all forensics run, call match_rule() → render_recommendation() → add to AnalyzeResponse
- Add
recommendation: str and rule_id: str to AnalyzeResponse so the frontend can surface it prominently at the top of the diagnostic card
Acceptance criteria
TDD approach
tests/test_verdict_generator.py
Construct synthetic metric objects directly — no real forensics calls needed.
Tests to write before implementation:
- High entropy + low faithfulness → R01
- High entropy + high chunk_spread → R01
- High score_gap + high unattributed_fraction → R02
- High score_gap + low faithfulness + low unattributed_fraction → R03
- High score_gap + high query_isolation + low faithfulness → R03
- Low decay_rate + high overconfident_fraction → R04
- High tail_mass + high weak_match_fraction → R05
- High score_gap + high faithfulness + high underconfident_fraction → R06
- All signals healthy → R07
- query_fit triggered + mismatch_type="query_mismatch" → R08
- query_fit triggered + mismatch_type="coverage_gap" → R09
- R08/R09 checked before R01 (triggered query_fit with entropy > 1.5 → R08/R09, not R01)
render_recommendation() returns string under 50 words
- Claude failure → falls back to
rule.action, no exception
- No input combination causes
match_rule() to crash or return None
Issue 9: Verdict-generation and recommendation prompt layer
Description
Translate all forensics outputs into actionable pipeline recommendations. This is the layer that makes the project legible to a stakeholder who doesn't know what faithfulness scores mean.
The architecture is two-stage:
Stage 1 — Rule-based recommendation matching: A deterministic decision tree matches the combined forensics signal pattern to a root cause category and a named pipeline component. This is Python logic, not an LLM call.
Stage 2 — LLM rendering: Claude receives the matched rule, the raw metric values, and a rendering prompt, and writes one specific, concrete recommendation sentence. The LLM's job is fluency and specificity — not reasoning about what's wrong.
This separation matters. If Claude reasons from scratch each time, recommendations are inconsistent and sometimes wrong. If the decision tree matches the pattern and Claude only renders it, recommendations are reliable and the logic is inspectable.
Stage 1: Recommendation matrix
Rule matching logic
Important: the thresholds above (1.5, 0.3, 0.2, 0.25, 0.5, 1.2, 0.6, 0.1, 0.4, 0.15, 0.75) are starting points. Calibrate empirically against RAGBench examples as part of this issue. Document what you tried and why in a comment next to each threshold — that calibration story is valuable in the README and in interviews.
Stage 2: LLM rendering prompts
Implementation details
services/verdict_generator.pywithmatch_rule()andrender_recommendation(rule, all_metrics) -> strprompts/verdict_prompts.pywith prompt constantsprompts/recommendation_rules.pywithRECOMMENDATION_RULESandget_rule()/analyze: after all forensics run, callmatch_rule()→render_recommendation()→ add toAnalyzeResponserecommendation: strandrule_id: strtoAnalyzeResponseso the frontend can surface it prominently at the top of the diagnostic cardAcceptance criteria
match_rule()accepts all forensics metric objects (includingQueryCorpusFitMetrics) and returns aRecommendationRulematch_rule()always returns R07 when all signals are healthy — never crashes or returns NoneRECOMMENDATION_RULESis reachable by some combination of inputsquery_fit.triggered=Trueandmismatch_type="query_mismatch"query_fit.triggered=Trueandmismatch_type="coverage_gap"render_recommendation()returns a string under 50 wordsprompts/verdict_prompts.py— no inline stringsprompts/recommendation_rules.py— no inline stringsrecommendationandrule_idfields added toAnalyzeResponseactionstring from the matched rule — never crashesmatch_rule()have a comment documenting the empirical basisTDD approach
Construct synthetic metric objects directly — no real forensics calls needed.
Tests to write before implementation:
render_recommendation()returns string under 50 wordsrule.action, no exceptionmatch_rule()to crash or return None