Prompt regression testing CLI — detect drift when your prompts change.
Like apidiff for APIs, but for LLM prompts.
Prompts are code. But unlike code, there's no test suite that catches when your system prompt changes silently and your model starts giving worse answers.
promptdiff gives you:
- Version tracking for your prompts (SHA-256 content hashing)
- Golden snapshot captures of expected outputs
- Regression detection using 5 similarity algorithms
- CI-ready exit codes and machine-readable reports
When your prompt changes from "Be concise" to "Be thorough and detailed", promptdiff catches the drift before your users do.
npm install -g promptdiff
# or
npx promptdiff init# 1. Initialize in your project
promptdiff init
# 2. Save your prompt + golden output as a snapshot
echo "You are a helpful assistant." > prompt.txt
echo "How can I help you today?" > golden-output.txt
promptdiff snapshot assistant \
--prompt-file prompt.txt \
--output-file golden-output.txt
# 3. Later, test a new model output against the snapshot
echo "What can I do for you?" | promptdiff test --prompt assistant --mode cosine --threshold 0.85
# exit 0 = pass, exit 1 = regression detected
# 4. See what changed between prompt versions
promptdiff diff abc123 def456 --prompt assistant
# 5. Generate a full report
promptdiff report --format markdown --output report.mdInitialize promptdiff in the current directory.
promptdiff init [--dir <dir>]Creates:
.promptdiff/— storage directory for versions and snapshots.promptdiff.yaml— configuration file
Capture a golden output snapshot for a prompt.
# From files
promptdiff snapshot my-system-prompt \
--prompt-file prompt.txt \
--output-file expected-output.txt
# With model metadata
promptdiff snapshot my-prompt \
--prompt-file prompt.txt \
--output-file expected.txt \
--model gpt-4o \
--provider openai \
--tag "production,v2"
# Overwrite existing snapshot
promptdiff snapshot my-prompt \
--prompt-file prompt.txt \
--output-file new-expected.txt \
--forceOptions:
| Flag | Description |
|---|---|
--prompt-file <file> |
Read prompt content from file |
--output-file <file> |
Golden output file |
--model <model> |
Model used to generate (informational) |
--provider <provider> |
Provider name (informational) |
--tag <tags> |
Comma-separated tags |
--force |
Overwrite existing snapshot |
Run a regression test against a stored snapshot.
# Pipe actual output
echo "actual model output" | promptdiff test --prompt my-prompt --mode cosine
# From file
promptdiff test \
--prompt my-prompt \
--file actual-output.txt \
--mode cosine \
--threshold 0.90
# Exact match mode
promptdiff test --prompt my-prompt --file output.txt --mode exact
# Structural (JSON schema) mode
promptdiff test \
--prompt my-json-prompt \
--file response.json \
--mode structural \
--schema schema.json
# Output as JSON (for CI parsing)
promptdiff test --prompt my-prompt --file out.txt --jsonOptions:
| Flag | Description |
|---|---|
--prompt <name> |
Prompt name to test (required) |
--file <file> |
Actual output file to test |
--stdin |
Read actual output from stdin |
--snapshot <id> |
Specific snapshot ID (default: latest) |
--mode <mode> |
Similarity mode (default: cosine) |
--threshold <value> |
Similarity threshold 0-1 (default: 0.85) |
--schema <file> |
JSON schema file for structural mode |
--json |
Output result as JSON |
--quiet |
Only exit code, no output |
Exit codes:
0— Test passed1— Regression detected (score below threshold)2— Error (missing files, bad config, etc.)
Show the diff between two prompt versions.
# Diff by hash prefix
promptdiff diff abc123 def456 --prompt my-prompt
# With more context
promptdiff diff abc123 def456 --prompt my-prompt --context 5
# JSON output
promptdiff diff abc123 def456 --prompt my-prompt --jsonOptions:
| Flag | Description |
|---|---|
--prompt <name> |
Prompt name (required) |
--context <lines> |
Context lines in diff (default: 3) |
--json |
Output as JSON |
Generate a full regression report for all tracked prompts.
# Text report (default)
promptdiff report
# Markdown report to file
promptdiff report --format markdown --output report.md
# JSON report for CI
promptdiff report --format json --output report.json
# Override similarity settings
promptdiff report --mode levenshtein --threshold 0.90Options:
| Flag | Description |
|---|---|
--format <fmt> |
json, text, or markdown (default: text) |
--output <file> |
Write report to file |
--suite <name> |
Run only prompts in a named suite from config |
--mode <mode> |
Similarity mode for all tests |
--threshold <value> |
Override threshold |
List all tracked prompts and their versions.
# List all prompts
promptdiff list
# List versions for a specific prompt
promptdiff list --prompt my-prompt
# JSON output
promptdiff list --jsonpromptdiff ships with 5 similarity algorithms, all implemented from scratch — no ML dependencies.
Byte-for-byte equality. Score is either 1.0 (match) or 0.0 (no match).
promptdiff test --prompt my-prompt --mode exactBest for: deterministic outputs, templated responses.
Normalized Levenshtein edit distance. Score = 1 - (edits / maxLen).
promptdiff test --prompt my-prompt --mode levenshtein --threshold 0.90Best for: short outputs where character-level changes matter (e.g., structured codes, IDs).
Algorithm: Standard DP with two-row optimization for O(n*m) time and O(n) space.
Token-level Jaccard index: |A ∩ B| / |A ∪ B|.
promptdiff test --prompt my-prompt --mode jaccard --threshold 0.70Best for: detecting vocabulary drift. Not sensitive to word order.
Algorithm: Tokenizes on whitespace + punctuation, lowercases, then computes set intersection over union.
TF-vector cosine similarity. cos(A, B) = (A · B) / (|A| × |B|).
promptdiff test --prompt my-prompt --mode cosine --threshold 0.85Best for: general-purpose prompt regression testing. Sensitive to vocabulary but not minor reorderings.
Algorithm: Builds term-frequency sparse vectors, then computes dot product and magnitudes. No external dependencies.
JSON schema validation. Returns 1.0 if the output matches the schema, partial score based on violation count.
promptdiff test \
--prompt json-extractor \
--file response.json \
--mode structural \
--schema schema.jsonBest for: prompts that must produce structured JSON output (extractors, classifiers, etc.).
Schema file example (schema.json):
{
"type": "object",
"required": ["entities", "sentiment"],
"properties": {
"entities": {
"type": "array",
"items": { "type": "string" }
},
"sentiment": {
"type": "string",
"enum": ["positive", "negative", "neutral"]
}
}
}| Mode | Recommended Threshold | When to Use |
|---|---|---|
| exact | 1.0 (implicit) | Deterministic outputs |
| levenshtein | 0.90 | Short outputs, typo tolerance |
| jaccard | 0.70 | Bag-of-words vocabulary match |
| cosine | 0.85 | General regression testing |
| structural | 1.0 (implicit) | JSON schema compliance |
.promptdiff.yaml (auto-generated by init):
# Directory to store promptdiff data
storageDir: .promptdiff
# Default similarity mode: exact | levenshtein | jaccard | cosine | structural
defaultMode: cosine
# Default similarity threshold (0-1)
defaultThreshold: 0.85
# Report format: json | text | markdown
reportFormat: text
# Provider information (optional, stored in snapshots for reference)
provider:
name: openai
model: gpt-4o
# Test suites (optional, used by `report --suite`)
suites:
- name: core-prompts
prompts:
- system-prompt
- user-greeting
mode: cosine
threshold: 0.90name: Prompt Regression Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
- run: npm install -g promptdiff
- name: Run prompt regression tests
run: |
# Test each prompt against stored golden outputs
cat outputs/system-prompt.txt | promptdiff test \
--prompt system-prompt \
--mode cosine \
--threshold 0.90
cat outputs/user-greeting.txt | promptdiff test \
--prompt user-greeting \
--mode exact
- name: Generate markdown report
if: always()
run: promptdiff report --format markdown --output regression-report.md
- name: Upload report
if: always()
uses: actions/upload-artifact@v4
with:
name: regression-report
path: regression-report.md# Will fail CI if regression detected
promptdiff test --prompt my-prompt --file output.txt --threshold 0.85
echo "Exit: $?" # 0=pass, 1=regression, 2=errorAll data lives in .promptdiff/ (or your configured storageDir):
.promptdiff/
├── prompts/
│ └── my-prompt/
│ ├── latest # Contains latest version hash
│ └── versions/
│ ├── abc123...json # Prompt version 1
│ └── def456...json # Prompt version 2
└── snapshots/
└── my-prompt/
├── latest # Contains latest snapshot ID
├── abc123def456.json # Snapshot 1
└── 789abc012def.json # Snapshot 2
{
"id": "abc123def456",
"name": "my-prompt",
"content": "You are a helpful assistant...",
"hash": "sha256-hex-64-chars",
"createdAt": "2024-01-15T12:00:00.000Z",
"tags": ["production", "v2"],
"metadata": {
"author": "alice"
}
}{
"id": "abc123def456789a",
"promptName": "my-prompt",
"promptHash": "sha256-of-prompt-content",
"content": "How can I help you today?",
"model": "gpt-4o",
"provider": "openai",
"capturedAt": "2024-01-15T12:00:00.000Z",
"metadata": {}
}src/
├── cli.ts # CLI entry point (commander.js)
├── config.ts # Config loading (.promptdiff.yaml)
├── types.ts # TypeScript types
├── core/
│ ├── versioner.ts # Prompt version tracking (SHA-256)
│ ├── snapshot.ts # Golden snapshot management
│ ├── differ.ts # Text diff engine (LCS-based)
│ ├── tester.ts # Regression test runner
│ └── reporter.ts # Report generation (JSON/text/markdown)
└── similarity/
├── index.ts # Unified similarity interface
├── levenshtein.ts # Edit distance (two-row DP)
├── jaccard.ts # Token-set Jaccard index
├── cosine.ts # TF-vector cosine similarity
└── structural.ts # JSON schema validator
Content-addressed versioning: Prompts are stored by SHA-256 hash of their content. Saving the same prompt twice returns the same version — no duplicates. The latest file is a pointer to the most recent hash.
Deterministic snapshot IDs: Snapshot IDs are derived from hash(promptHash + content), so capturing the same output for the same prompt version is idempotent.
Zero ML dependencies: All similarity algorithms are implemented from scratch using classical NLP techniques. No scikit-learn, no sentence-transformers, no API calls. This keeps the tool fast, offline-capable, and dependency-free.
Sparse vector representation: Cosine similarity uses Map<string, number> for TF vectors instead of dense arrays, making it efficient for long prompts with large vocabularies.
LCS-based diffing: The differ uses a true LCS (Longest Common Subsequence) algorithm for small inputs and a fast greedy LCS for large inputs (>100K cells), preventing memory exhaustion on large prompts.
# Install dependencies
npm install
# Run tests
npm test
# Watch mode
npm run test:watch
# Build
npm run build
# Type check
npm run typecheckMIT — see LICENSE
Built as part of the Agent Company pipeline — Round 73.