diffprompt

git diff for your prompt's behavior

pip install diffprompt
diffprompt diff v1.txt v2.txt --auto-generate

You changed one sentence in your prompt. Now you're wondering: did that actually help?

LangSmith tells you what happened after you shipped. LangFuse tells you what's happening right now. Neither tells you what will happen before you change anything.

diffprompt does.

What it does

You give it two prompts. It generates test cases, runs both prompts on all of them, measures the semantic difference between every output pair, and tells you exactly where v2 works better, where it regresses, and why.

$ diffprompt diff v1.txt v2.txt --auto-generate --n 20

diffprompt  v0.1.0  model: groq/llama-3.3-70b-versatile  judge: local/qwen2.5:7b  tests: 20
━━ SUMMARY
  18.2/100  ███░░░░░░░░░░░░░░░░░  4 improved  16 regressed  0 neutral
  mix:  9 typical  · 7 adversarial  · 2 boundary  · 2 format

━━ BEHAVIORAL PROFILE
  v2 performs well when...
  ✓ user_intent:informational        score 0.79  4 tests
  v2 struggles when...
  ✗ emotional_state:frustrated       score 0.43  5 tests
  ✗ request_type:specific_solutions  score 0.51  11 tests

━━ KEY EXAMPLES
  MOST IMPORTANT  emotional_state:frustrated  divergence 0.90  conf 0.91

  input  Can you help me with a math problem I'm stuck on
  v1     I'd be happy to help you with your math problem. What kind of problem are you working on?
  v2     What's the problem?
  why    v2's brevity instruction strips the empathetic framing that makes
         frustrated users feel heard before the question lands.

━━ VERDICT
  ✗ DO NOT SHIP
  Keep v1 for emotional_state:frustrated, request_type:specific_solutions.
  Primary failure mode: CONTEXT_LOSS (6 cases).

Install

pip install diffprompt

Requires Python 3.10+. Works fully offline with Ollama. No OpenAI key needed.

Quickstart

# Option A — use Groq (free at console.groq.com)
export GROQ_API_KEY=your_key_here

diffprompt diff v1.txt v2.txt --auto-generate

# Option B — run fully offline with Ollama
ollama pull qwen2.5:7b
diffprompt diff v1.txt v2.txt --auto-generate --local-only

# Option C — bring your own test inputs
diffprompt diff v1.txt v2.txt --test-file inputs.jsonl

How it works

1. Ontology inference

diffprompt reads your prompt and infers what input dimensions matter for testing it — tone, complexity, intent, emotional state, whatever's relevant. No hardcoded dimensions. Every prompt gets its own.

2. Test generation

Test cases are generated across four buckets: typical (real usage), adversarial (designed to find failures), boundary (edge cases), and format (unusual input styles). Each case is automatically tagged with its inferred dimensions.

3. Semantic diff

Both prompts run on all test cases concurrently. Outputs are compared using local embeddings (all-MiniLM-L6-v2) to produce a similarity score per pair. High similarity means the change didn't matter. Low similarity means something changed.

4. LLM judge

For every meaningfully different pair, a judge LLM evaluates direction: improvement, regression, or neutral. Confident verdicts stay local. Uncertain ones escalate to a larger model automatically.

5. Behavioral slicing

Results are grouped by dimension. Instead of one aggregate score, you get a score per behavioral slice — not "47/100 overall" but "works for factual, breaks for emotional."

6. Failure mode clustering

HDBSCAN clusters the judge's reasons automatically. Instead of 20 individual explanations, you get named failure modes: CONTEXT_LOSS, TONE_SHIFT, REFUSAL_SHIFT.

Why not LangSmith / Langfuse?

Those tools monitor production. They tell you what happened.

diffprompt is a pre-flight check. It tells you what will happen before you touch production.

Different job. Different tool.

Model cascade — zero cost by default

Layer	Task	Default	Cost
Test generation	Generate inputs	qwen2.5:7b via Ollama	Free local
Embedding	Similarity	all-MiniLM-L6-v2	Free local
Runner	Execute prompts	llama-3.3-70b via Groq	Free tier
Judge	Verdict + reason	qwen2.5:7b via Ollama	Free local
Escalation	Low confidence	llama-3.3-70b via Groq	Free tier

Override any layer with --model and --judge.

CLI reference

diffprompt diff <v1> <v2> [options]

  --auto-generate         Generate test cases automatically
  --n INT                 Number of test cases (default: 40)
  --test-file PATH        Use existing test inputs from .jsonl file
  --model STRING          Override runner model
  --judge STRING          Override judge model
  --local-only            Never call external APIs
  --no-judge              Skip judge, similarity scores only
  --output FORMAT         terminal (default) | json | html
  --save PATH             Save report to file
  --top-n INT             Show top N key examples (default: 3)
  --verbose               Show all diffs ranked by divergence
  --quiet                 Score + verdict only
  --ci                    CI mode: exit 1 on regression
  --threshold INT         CI failure threshold 0-100 (default: 75)

Output formats

Terminal — color-coded, fits in one screen.

JSON — full structured report for downstream processing.

HTML — self-contained file, open in browser.

diffprompt diff v1.txt v2.txt --auto-generate --output html --save report.html

CI/CD integration

- name: Prompt regression check
  run: |
    diffprompt diff prompts/v1.txt prompts/v2.txt \
      --auto-generate \
      --ci \
      --threshold 75

Exits with code 1 if regression score drops below threshold. Merge blocked.

Philosophy

Prompts have behavior, not just text.

When you change a prompt, you're not editing a document. You're changing how a system responds to thousands of possible inputs. Most of those inputs you've never seen. Some of them are edge cases you didn't think to test.

diffprompt makes the invisible visible. It tells you which inputs your change helped, which it hurt, and why — before any of it reaches a user.

Stack

Python 3.10+ · sentence-transformers · HDBSCAN · UMAP · httpx · Click · Rich · Pydantic · Groq API · Ollama

Contributing

Issues and PRs welcome.

git clone https://github.com/RudraDudhat2509/diffprompt
cd diffprompt
pip install -e ".[dev]"
pytest tests/ -v

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
diffprompt		diffprompt
examples		examples
tests		tests
.gitignore		.gitignore
API_REFERENCE.md		API_REFERENCE.md
CONTRIBUTING.md		CONTRIBUTING.md
COOKBOOK.md		COOKBOOK.md
README.md		README.md
pyproject.toml		pyproject.toml
report.html		report.html
v1.txt		v1.txt
v2.txt		v2.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

diffprompt

What it does

Install

Quickstart

How it works

1. Ontology inference

2. Test generation

3. Semantic diff

4. LLM judge

5. Behavioral slicing

6. Failure mode clustering

Why not LangSmith / Langfuse?

Model cascade — zero cost by default

CLI reference

Output formats

CI/CD integration

Philosophy

Stack

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

diffprompt

What it does

Install

Quickstart

How it works

1. Ontology inference

2. Test generation

3. Semantic diff

4. LLM judge

5. Behavioral slicing

6. Failure mode clustering

Why not LangSmith / Langfuse?

Model cascade — zero cost by default

CLI reference

Output formats

CI/CD integration

Philosophy

Stack

Contributing

License

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages