Skip to content

Audit all LLM prompts — evaluate quality, constraints, and deterministic offload opportunities #16

@SriMed

Description

@SriMed

Question to answer

Which prompts in backend/prompts/ have reliability or quality risks, and where can LLM work be replaced by deterministic logic?

Background

As of issue #9, the system has five active LLM-backed prompts across backend/prompts/. Quality of generated output (especially suggested questions in query_corpus_fit) is currently unconstrained at the code level — if the model returns generic or unhelpful questions, the mismatch classification still proceeds normally. No systematic review of prompt design has been done. This investigation catalogs every prompt, assesses reliability risks, and identifies where LLM work can be replaced by deterministic logic.

Motivating example: query_corpus_fit.py:64–80 — question generation quality is entirely LLM-dependent; poor questions still allow mismatch classification to proceed without any code-level signal.

This issue is intentionally scheduled after #9 so that verdict_prompts.py and calibration_prompts.py are implemented before the audit runs.

Scope

In scope:

NOT in scope:

  • Implementing any prompt changes (each finding → separate follow-on issue)
  • Retry/parse-robustness fixes to calling code (separate bug)
  • Automated eval harness or benchmark; manual sampling is sufficient for now

What to investigate

  • generation_prompts.pyGENERATION_SYSTEM_PROMPT + build_generation_prompt() used in services/generator.py; output is free-form prose
  • hedging_prompts.pyCLAIM_EXTRACTION_PROMPT (JSON array) + ENTAILMENT_PROMPT (binary keyword) used in services/forensics/hedging_mismatch.py
  • query_fit_prompts.pybuild_question_generation_prompt() (JSON array of strings) used in services/forensics/query_corpus_fit.py
  • verdict_prompts.py — stub filled by Issue 9: Verdict-generation and recommendation prompt layer #9; review design before hardening
  • calibration_prompts.py — stub filled by Issue 9: Verdict-generation and recommendation prompt layer #9; review design before hardening
  • (recommendation_rules.py is data-only, no LLM calls — skip)

Prime offload candidates to assess:

  • ENTAILMENT_PROMPT — chunk/claim cosine similarity (already computed by chunk_attribution) may substitute for binary LLM entailment
  • CLAIM_EXTRACTION_PROMPT — spaCy sentence splitting could replace or augment; hedging classification is already deterministic

How to evaluate

  • Format reliability: Is the output format constraint tight enough to parse without a bare except? What failure modes exist?
  • Instruction quality: Do the instructions leave room for generic, ambiguous, or unhelpful output? Collect 3–5 sample outputs per prompt to ground the assessment.
  • Deterministic offload: Could this step be done more cheaply/reliably without an LLM? What is lost, what is gained?
  • Stub readiness: For verdict and calibration prompts, does the design align with the intent in CLAUDE.md and issue Issue 9: Verdict-generation and recommendation prompt layer #9?

Report format

docs/prompt-audit.md with one section per prompt file, each containing:

  • Purpose: what the prompt asks the LLM to do
  • Call site: file and line range
  • Output contract: expected format and current parse handling
  • Sample outputs: 3–5 representative examples (manual runs)
  • Risks: format reliability issues or quality gaps observed
  • Recommendation: proposed rewrite snippet OR deterministic alternative OR "no change needed", with rationale

Followed by a summary table and a list of follow-on issues filed.

Definition of done

  • All items in "What to investigate" have been addressed
  • Findings are written up in docs/prompt-audit.md in the format specified above
  • A clear recommendation or conclusion is stated for each prompt
  • A follow-up issue is filed for every actionable finding

Metadata

Metadata

Assignees

No one assigned

    Labels

    investigationResearch or analysis needed before a decision can be made

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions