Audit all LLM prompts — evaluate quality, constraints, and deterministic offload opportunities

## Question to answer
Which prompts in `backend/prompts/` have reliability or quality risks, and where can LLM work be replaced by deterministic logic?

## Background
As of issue #9, the system has five active LLM-backed prompts across `backend/prompts/`. Quality of generated output (especially suggested questions in `query_corpus_fit`) is currently unconstrained at the code level — if the model returns generic or unhelpful questions, the mismatch classification still proceeds normally. No systematic review of prompt design has been done. This investigation catalogs every prompt, assesses reliability risks, and identifies where LLM work can be replaced by deterministic logic.

Motivating example: `query_corpus_fit.py:64–80` — question generation quality is entirely LLM-dependent; poor questions still allow mismatch classification to proceed without any code-level signal.

This issue is intentionally scheduled after #9 so that `verdict_prompts.py` and `calibration_prompts.py` are implemented before the audit runs.

## Scope
**In scope:**
- All files in `backend/prompts/`, including stubs filled by #9
- Both improving prompt instructions AND replacing steps with deterministic logic
- Filing follow-on issues for each actionable finding

**NOT in scope:**
- Implementing any prompt changes (each finding → separate follow-on issue)
- Retry/parse-robustness fixes to calling code (separate bug)
- Automated eval harness or benchmark; manual sampling is sufficient for now

## What to investigate
- `generation_prompts.py` — `GENERATION_SYSTEM_PROMPT` + `build_generation_prompt()` used in `services/generator.py`; output is free-form prose
- `hedging_prompts.py` — `CLAIM_EXTRACTION_PROMPT` (JSON array) + `ENTAILMENT_PROMPT` (binary keyword) used in `services/forensics/hedging_mismatch.py`
- `query_fit_prompts.py` — `build_question_generation_prompt()` (JSON array of strings) used in `services/forensics/query_corpus_fit.py`
- `verdict_prompts.py` — stub filled by #9; review design before hardening
- `calibration_prompts.py` — stub filled by #9; review design before hardening
- (`recommendation_rules.py` is data-only, no LLM calls — skip)

Prime offload candidates to assess:
- `ENTAILMENT_PROMPT` — chunk/claim cosine similarity (already computed by `chunk_attribution`) may substitute for binary LLM entailment
- `CLAIM_EXTRACTION_PROMPT` — spaCy sentence splitting could replace or augment; hedging classification is already deterministic

## How to evaluate
- **Format reliability**: Is the output format constraint tight enough to parse without a bare `except`? What failure modes exist?
- **Instruction quality**: Do the instructions leave room for generic, ambiguous, or unhelpful output? Collect 3–5 sample outputs per prompt to ground the assessment.
- **Deterministic offload**: Could this step be done more cheaply/reliably without an LLM? What is lost, what is gained?
- **Stub readiness**: For verdict and calibration prompts, does the design align with the intent in CLAUDE.md and issue #9?

## Report format
`docs/prompt-audit.md` with one section per prompt file, each containing:
- **Purpose**: what the prompt asks the LLM to do
- **Call site**: file and line range
- **Output contract**: expected format and current parse handling
- **Sample outputs**: 3–5 representative examples (manual runs)
- **Risks**: format reliability issues or quality gaps observed
- **Recommendation**: proposed rewrite snippet OR deterministic alternative OR "no change needed", with rationale

Followed by a summary table and a list of follow-on issues filed.

## Definition of done
- [ ] All items in "What to investigate" have been addressed
- [ ] Findings are written up in `docs/prompt-audit.md` in the format specified above
- [ ] A clear recommendation or conclusion is stated for each prompt
- [ ] A follow-up issue is filed for every actionable finding

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audit all LLM prompts — evaluate quality, constraints, and deterministic offload opportunities #16

Question to answer

Background

Scope

What to investigate

How to evaluate

Report format

Definition of done

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Audit all LLM prompts — evaluate quality, constraints, and deterministic offload opportunities #16

Description

Question to answer

Background

Scope

What to investigate

How to evaluate

Report format

Definition of done

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions