git diff for your prompt's behavior
pip install diffprompt
diffprompt diff v1.txt v2.txt --auto-generateYou changed one sentence in your prompt. Now you're wondering: did that actually help?
LangSmith tells you what happened after you shipped. LangFuse tells you what's happening right now. Neither tells you what will happen before you change anything.
diffprompt does.
You give it two prompts. It generates test cases, runs both prompts on all of them, measures the semantic difference between every output pair, and tells you exactly where v2 works better, where it regresses, and why.
$ diffprompt diff v1.txt v2.txt --auto-generate --n 20
diffprompt v0.1.0 model: groq/llama-3.3-70b-versatile judge: local/qwen2.5:7b tests: 20
━━ SUMMARY
18.2/100 ███░░░░░░░░░░░░░░░░░ 4 improved 16 regressed 0 neutral
mix: 9 typical · 7 adversarial · 2 boundary · 2 format
━━ BEHAVIORAL PROFILE
v2 performs well when...
✓ user_intent:informational score 0.79 4 tests
v2 struggles when...
✗ emotional_state:frustrated score 0.43 5 tests
✗ request_type:specific_solutions score 0.51 11 tests
━━ KEY EXAMPLES
MOST IMPORTANT emotional_state:frustrated divergence 0.90 conf 0.91
input Can you help me with a math problem I'm stuck on
v1 I'd be happy to help you with your math problem. What kind of problem are you working on?
v2 What's the problem?
why v2's brevity instruction strips the empathetic framing that makes
frustrated users feel heard before the question lands.
━━ VERDICT
✗ DO NOT SHIP
Keep v1 for emotional_state:frustrated, request_type:specific_solutions.
Primary failure mode: CONTEXT_LOSS (6 cases).
pip install diffpromptRequires Python 3.10+. Works fully offline with Ollama. No OpenAI key needed.
# Option A — use Groq (free at console.groq.com)
export GROQ_API_KEY=your_key_here
diffprompt diff v1.txt v2.txt --auto-generate
# Option B — run fully offline with Ollama
ollama pull qwen2.5:7b
diffprompt diff v1.txt v2.txt --auto-generate --local-only
# Option C — bring your own test inputs
diffprompt diff v1.txt v2.txt --test-file inputs.jsonldiffprompt reads your prompt and infers what input dimensions matter for testing it — tone, complexity, intent, emotional state, whatever's relevant. No hardcoded dimensions. Every prompt gets its own.
Test cases are generated across four buckets: typical (real usage), adversarial (designed to find failures), boundary (edge cases), and format (unusual input styles). Each case is automatically tagged with its inferred dimensions.
Both prompts run on all test cases concurrently. Outputs are compared using local embeddings (all-MiniLM-L6-v2) to produce a similarity score per pair. High similarity means the change didn't matter. Low similarity means something changed.
For every meaningfully different pair, a judge LLM evaluates direction: improvement, regression, or neutral. Confident verdicts stay local. Uncertain ones escalate to a larger model automatically.
Results are grouped by dimension. Instead of one aggregate score, you get a score per behavioral slice — not "47/100 overall" but "works for factual, breaks for emotional."
HDBSCAN clusters the judge's reasons automatically. Instead of 20 individual explanations, you get named failure modes: CONTEXT_LOSS, TONE_SHIFT, REFUSAL_SHIFT.
Those tools monitor production. They tell you what happened.
diffprompt is a pre-flight check. It tells you what will happen before you touch production.
Different job. Different tool.
| Layer | Task | Default | Cost |
|---|---|---|---|
| Test generation | Generate inputs | qwen2.5:7b via Ollama | Free local |
| Embedding | Similarity | all-MiniLM-L6-v2 | Free local |
| Runner | Execute prompts | llama-3.3-70b via Groq | Free tier |
| Judge | Verdict + reason | qwen2.5:7b via Ollama | Free local |
| Escalation | Low confidence | llama-3.3-70b via Groq | Free tier |
Override any layer with --model and --judge.
diffprompt diff <v1> <v2> [options]
--auto-generate Generate test cases automatically
--n INT Number of test cases (default: 40)
--test-file PATH Use existing test inputs from .jsonl file
--model STRING Override runner model
--judge STRING Override judge model
--local-only Never call external APIs
--no-judge Skip judge, similarity scores only
--output FORMAT terminal (default) | json | html
--save PATH Save report to file
--top-n INT Show top N key examples (default: 3)
--verbose Show all diffs ranked by divergence
--quiet Score + verdict only
--ci CI mode: exit 1 on regression
--threshold INT CI failure threshold 0-100 (default: 75)
Terminal — color-coded, fits in one screen.
JSON — full structured report for downstream processing.
HTML — self-contained file, open in browser.
diffprompt diff v1.txt v2.txt --auto-generate --output html --save report.html- name: Prompt regression check
run: |
diffprompt diff prompts/v1.txt prompts/v2.txt \
--auto-generate \
--ci \
--threshold 75Exits with code 1 if regression score drops below threshold. Merge blocked.
Prompts have behavior, not just text.
When you change a prompt, you're not editing a document. You're changing how a system responds to thousands of possible inputs. Most of those inputs you've never seen. Some of them are edge cases you didn't think to test.
diffprompt makes the invisible visible. It tells you which inputs your change helped, which it hurt, and why — before any of it reaches a user.
Python 3.10+ · sentence-transformers · HDBSCAN · UMAP · httpx · Click · Rich · Pydantic · Groq API · Ollama
Issues and PRs welcome.
git clone https://github.com/RudraDudhat2509/diffprompt
cd diffprompt
pip install -e ".[dev]"
pytest tests/ -vMIT