You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Which prompts in backend/prompts/ have reliability or quality risks, and where can LLM work be replaced by deterministic logic?
Background
As of issue #9, the system has five active LLM-backed prompts across backend/prompts/. Quality of generated output (especially suggested questions in query_corpus_fit) is currently unconstrained at the code level — if the model returns generic or unhelpful questions, the mismatch classification still proceeds normally. No systematic review of prompt design has been done. This investigation catalogs every prompt, assesses reliability risks, and identifies where LLM work can be replaced by deterministic logic.
Motivating example: query_corpus_fit.py:64–80 — question generation quality is entirely LLM-dependent; poor questions still allow mismatch classification to proceed without any code-level signal.
This issue is intentionally scheduled after #9 so that verdict_prompts.py and calibration_prompts.py are implemented before the audit runs.
(recommendation_rules.py is data-only, no LLM calls — skip)
Prime offload candidates to assess:
ENTAILMENT_PROMPT — chunk/claim cosine similarity (already computed by chunk_attribution) may substitute for binary LLM entailment
CLAIM_EXTRACTION_PROMPT — spaCy sentence splitting could replace or augment; hedging classification is already deterministic
How to evaluate
Format reliability: Is the output format constraint tight enough to parse without a bare except? What failure modes exist?
Instruction quality: Do the instructions leave room for generic, ambiguous, or unhelpful output? Collect 3–5 sample outputs per prompt to ground the assessment.
Deterministic offload: Could this step be done more cheaply/reliably without an LLM? What is lost, what is gained?
Question to answer
Which prompts in
backend/prompts/have reliability or quality risks, and where can LLM work be replaced by deterministic logic?Background
As of issue #9, the system has five active LLM-backed prompts across
backend/prompts/. Quality of generated output (especially suggested questions inquery_corpus_fit) is currently unconstrained at the code level — if the model returns generic or unhelpful questions, the mismatch classification still proceeds normally. No systematic review of prompt design has been done. This investigation catalogs every prompt, assesses reliability risks, and identifies where LLM work can be replaced by deterministic logic.Motivating example:
query_corpus_fit.py:64–80— question generation quality is entirely LLM-dependent; poor questions still allow mismatch classification to proceed without any code-level signal.This issue is intentionally scheduled after #9 so that
verdict_prompts.pyandcalibration_prompts.pyare implemented before the audit runs.Scope
In scope:
backend/prompts/, including stubs filled by Issue 9: Verdict-generation and recommendation prompt layer #9NOT in scope:
What to investigate
generation_prompts.py—GENERATION_SYSTEM_PROMPT+build_generation_prompt()used inservices/generator.py; output is free-form prosehedging_prompts.py—CLAIM_EXTRACTION_PROMPT(JSON array) +ENTAILMENT_PROMPT(binary keyword) used inservices/forensics/hedging_mismatch.pyquery_fit_prompts.py—build_question_generation_prompt()(JSON array of strings) used inservices/forensics/query_corpus_fit.pyverdict_prompts.py— stub filled by Issue 9: Verdict-generation and recommendation prompt layer #9; review design before hardeningcalibration_prompts.py— stub filled by Issue 9: Verdict-generation and recommendation prompt layer #9; review design before hardeningrecommendation_rules.pyis data-only, no LLM calls — skip)Prime offload candidates to assess:
ENTAILMENT_PROMPT— chunk/claim cosine similarity (already computed bychunk_attribution) may substitute for binary LLM entailmentCLAIM_EXTRACTION_PROMPT— spaCy sentence splitting could replace or augment; hedging classification is already deterministicHow to evaluate
except? What failure modes exist?Report format
docs/prompt-audit.mdwith one section per prompt file, each containing:Followed by a summary table and a list of follow-on issues filed.
Definition of done
docs/prompt-audit.mdin the format specified above