promptdiff

Prompt regression testing CLI — detect drift when your prompts change.

Like apidiff for APIs, but for LLM prompts.

Why This Exists

Prompts are code. But unlike code, there's no test suite that catches when your system prompt changes silently and your model starts giving worse answers.

promptdiff gives you:

Version tracking for your prompts (SHA-256 content hashing)
Golden snapshot captures of expected outputs
Regression detection using 5 similarity algorithms
CI-ready exit codes and machine-readable reports

When your prompt changes from "Be concise" to "Be thorough and detailed", promptdiff catches the drift before your users do.

Installation

npm install -g promptdiff
# or
npx promptdiff init

Quick Start

# 1. Initialize in your project
promptdiff init

# 2. Save your prompt + golden output as a snapshot
echo "You are a helpful assistant." > prompt.txt
echo "How can I help you today?" > golden-output.txt
promptdiff snapshot assistant \
  --prompt-file prompt.txt \
  --output-file golden-output.txt

# 3. Later, test a new model output against the snapshot
echo "What can I do for you?" | promptdiff test --prompt assistant --mode cosine --threshold 0.85
# exit 0 = pass, exit 1 = regression detected

# 4. See what changed between prompt versions
promptdiff diff abc123 def456 --prompt assistant

# 5. Generate a full report
promptdiff report --format markdown --output report.md

CLI Commands

`promptdiff init`

Initialize promptdiff in the current directory.

promptdiff init [--dir <dir>]

Creates:

.promptdiff/ — storage directory for versions and snapshots
.promptdiff.yaml — configuration file

`promptdiff snapshot <name>`

Capture a golden output snapshot for a prompt.

# From files
promptdiff snapshot my-system-prompt \
  --prompt-file prompt.txt \
  --output-file expected-output.txt

# With model metadata
promptdiff snapshot my-prompt \
  --prompt-file prompt.txt \
  --output-file expected.txt \
  --model gpt-4o \
  --provider openai \
  --tag "production,v2"

# Overwrite existing snapshot
promptdiff snapshot my-prompt \
  --prompt-file prompt.txt \
  --output-file new-expected.txt \
  --force

Options:

Flag	Description
`--prompt-file <file>`	Read prompt content from file
`--output-file <file>`	Golden output file
`--model <model>`	Model used to generate (informational)
`--provider <provider>`	Provider name (informational)
`--tag <tags>`	Comma-separated tags
`--force`	Overwrite existing snapshot

`promptdiff test`

Run a regression test against a stored snapshot.

# Pipe actual output
echo "actual model output" | promptdiff test --prompt my-prompt --mode cosine

# From file
promptdiff test \
  --prompt my-prompt \
  --file actual-output.txt \
  --mode cosine \
  --threshold 0.90

# Exact match mode
promptdiff test --prompt my-prompt --file output.txt --mode exact

# Structural (JSON schema) mode
promptdiff test \
  --prompt my-json-prompt \
  --file response.json \
  --mode structural \
  --schema schema.json

# Output as JSON (for CI parsing)
promptdiff test --prompt my-prompt --file out.txt --json

Options:

Flag	Description
`--prompt <name>`	Prompt name to test (required)
`--file <file>`	Actual output file to test
`--stdin`	Read actual output from stdin
`--snapshot <id>`	Specific snapshot ID (default: latest)
`--mode <mode>`	Similarity mode (default: cosine)
`--threshold <value>`	Similarity threshold 0-1 (default: 0.85)
`--schema <file>`	JSON schema file for structural mode
`--json`	Output result as JSON
`--quiet`	Only exit code, no output

Exit codes:

0 — Test passed
1 — Regression detected (score below threshold)
2 — Error (missing files, bad config, etc.)

`promptdiff diff <v1> <v2>`

Show the diff between two prompt versions.

# Diff by hash prefix
promptdiff diff abc123 def456 --prompt my-prompt

# With more context
promptdiff diff abc123 def456 --prompt my-prompt --context 5

# JSON output
promptdiff diff abc123 def456 --prompt my-prompt --json

Options:

Flag	Description
`--prompt <name>`	Prompt name (required)
`--context <lines>`	Context lines in diff (default: 3)
`--json`	Output as JSON

`promptdiff report`

Generate a full regression report for all tracked prompts.

# Text report (default)
promptdiff report

# Markdown report to file
promptdiff report --format markdown --output report.md

# JSON report for CI
promptdiff report --format json --output report.json

# Override similarity settings
promptdiff report --mode levenshtein --threshold 0.90

Options:

Flag	Description
`--format <fmt>`	json, text, or markdown (default: text)
`--output <file>`	Write report to file
`--suite <name>`	Run only prompts in a named suite from config
`--mode <mode>`	Similarity mode for all tests
`--threshold <value>`	Override threshold

`promptdiff list`

List all tracked prompts and their versions.

# List all prompts
promptdiff list

# List versions for a specific prompt
promptdiff list --prompt my-prompt

# JSON output
promptdiff list --json

Similarity Modes

promptdiff ships with 5 similarity algorithms, all implemented from scratch — no ML dependencies.

`exact`

Byte-for-byte equality. Score is either 1.0 (match) or 0.0 (no match).

promptdiff test --prompt my-prompt --mode exact

Best for: deterministic outputs, templated responses.

`levenshtein`

Normalized Levenshtein edit distance. Score = 1 - (edits / maxLen).

promptdiff test --prompt my-prompt --mode levenshtein --threshold 0.90

Best for: short outputs where character-level changes matter (e.g., structured codes, IDs).

Algorithm: Standard DP with two-row optimization for O(n*m) time and O(n) space.

`jaccard`

Token-level Jaccard index: |A ∩ B| / |A ∪ B|.

promptdiff test --prompt my-prompt --mode jaccard --threshold 0.70

Best for: detecting vocabulary drift. Not sensitive to word order.

Algorithm: Tokenizes on whitespace + punctuation, lowercases, then computes set intersection over union.

`cosine` (default)

TF-vector cosine similarity. cos(A, B) = (A · B) / (|A| × |B|).

promptdiff test --prompt my-prompt --mode cosine --threshold 0.85

Best for: general-purpose prompt regression testing. Sensitive to vocabulary but not minor reorderings.

Algorithm: Builds term-frequency sparse vectors, then computes dot product and magnitudes. No external dependencies.

`structural`

JSON schema validation. Returns 1.0 if the output matches the schema, partial score based on violation count.

promptdiff test \
  --prompt json-extractor \
  --file response.json \
  --mode structural \
  --schema schema.json

Best for: prompts that must produce structured JSON output (extractors, classifiers, etc.).

Schema file example (schema.json):

{
  "type": "object",
  "required": ["entities", "sentiment"],
  "properties": {
    "entities": {
      "type": "array",
      "items": { "type": "string" }
    },
    "sentiment": {
      "type": "string",
      "enum": ["positive", "negative", "neutral"]
    }
  }
}

Threshold Guide

Mode	Recommended Threshold	When to Use
exact	1.0 (implicit)	Deterministic outputs
levenshtein	0.90	Short outputs, typo tolerance
jaccard	0.70	Bag-of-words vocabulary match
cosine	0.85	General regression testing
structural	1.0 (implicit)	JSON schema compliance

Configuration

.promptdiff.yaml (auto-generated by init):

# Directory to store promptdiff data
storageDir: .promptdiff

# Default similarity mode: exact | levenshtein | jaccard | cosine | structural
defaultMode: cosine

# Default similarity threshold (0-1)
defaultThreshold: 0.85

# Report format: json | text | markdown
reportFormat: text

# Provider information (optional, stored in snapshots for reference)
provider:
  name: openai
  model: gpt-4o

# Test suites (optional, used by `report --suite`)
suites:
  - name: core-prompts
    prompts:
      - system-prompt
      - user-greeting
    mode: cosine
    threshold: 0.90

CI Integration

GitHub Actions

name: Prompt Regression Tests
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: npm install -g promptdiff

      - name: Run prompt regression tests
        run: |
          # Test each prompt against stored golden outputs
          cat outputs/system-prompt.txt | promptdiff test \
            --prompt system-prompt \
            --mode cosine \
            --threshold 0.90
          
          cat outputs/user-greeting.txt | promptdiff test \
            --prompt user-greeting \
            --mode exact

      - name: Generate markdown report
        if: always()
        run: promptdiff report --format markdown --output regression-report.md

      - name: Upload report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: regression-report
          path: regression-report.md

Exit Codes in CI

# Will fail CI if regression detected
promptdiff test --prompt my-prompt --file output.txt --threshold 0.85
echo "Exit: $?"  # 0=pass, 1=regression, 2=error

Storage Format

All data lives in .promptdiff/ (or your configured storageDir):

.promptdiff/
├── prompts/
│   └── my-prompt/
│       ├── latest           # Contains latest version hash
│       └── versions/
│           ├── abc123...json   # Prompt version 1
│           └── def456...json   # Prompt version 2
└── snapshots/
    └── my-prompt/
        ├── latest           # Contains latest snapshot ID
        ├── abc123def456.json   # Snapshot 1
        └── 789abc012def.json   # Snapshot 2

Version File Format

{
  "id": "abc123def456",
  "name": "my-prompt",
  "content": "You are a helpful assistant...",
  "hash": "sha256-hex-64-chars",
  "createdAt": "2024-01-15T12:00:00.000Z",
  "tags": ["production", "v2"],
  "metadata": {
    "author": "alice"
  }
}

Snapshot File Format

{
  "id": "abc123def456789a",
  "promptName": "my-prompt",
  "promptHash": "sha256-of-prompt-content",
  "content": "How can I help you today?",
  "model": "gpt-4o",
  "provider": "openai",
  "capturedAt": "2024-01-15T12:00:00.000Z",
  "metadata": {}
}

Architecture

src/
├── cli.ts                    # CLI entry point (commander.js)
├── config.ts                 # Config loading (.promptdiff.yaml)
├── types.ts                  # TypeScript types
├── core/
│   ├── versioner.ts          # Prompt version tracking (SHA-256)
│   ├── snapshot.ts           # Golden snapshot management
│   ├── differ.ts             # Text diff engine (LCS-based)
│   ├── tester.ts             # Regression test runner
│   └── reporter.ts           # Report generation (JSON/text/markdown)
└── similarity/
    ├── index.ts              # Unified similarity interface
    ├── levenshtein.ts        # Edit distance (two-row DP)
    ├── jaccard.ts            # Token-set Jaccard index
    ├── cosine.ts             # TF-vector cosine similarity
    └── structural.ts         # JSON schema validator

Core Design Decisions

Content-addressed versioning: Prompts are stored by SHA-256 hash of their content. Saving the same prompt twice returns the same version — no duplicates. The latest file is a pointer to the most recent hash.

Deterministic snapshot IDs: Snapshot IDs are derived from hash(promptHash + content), so capturing the same output for the same prompt version is idempotent.

Zero ML dependencies: All similarity algorithms are implemented from scratch using classical NLP techniques. No scikit-learn, no sentence-transformers, no API calls. This keeps the tool fast, offline-capable, and dependency-free.

Sparse vector representation: Cosine similarity uses Map<string, number> for TF vectors instead of dense arrays, making it efficient for long prompts with large vocabularies.

LCS-based diffing: The differ uses a true LCS (Longest Common Subsequence) algorithm for small inputs and a fast greedy LCS for large inputs (>100K cells), preventing memory exhaustion on large prompts.

Development

# Install dependencies
npm install

# Run tests
npm test

# Watch mode
npm run test:watch

# Build
npm run build

# Type check
npm run typecheck

License

MIT — see LICENSE

Built as part of the Agent Company pipeline — Round 73.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.promptdiff		.promptdiff
dist		dist
node_modules		node_modules
src		src
tests		tests
.promptdiff.yaml		.promptdiff.yaml
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Folders and files

Latest commit

History

Repository files navigation

promptdiff

Why This Exists

Installation

Quick Start

CLI Commands

promptdiff init

promptdiff snapshot <name>

promptdiff test

promptdiff diff <v1> <v2>

promptdiff report

promptdiff list

Similarity Modes

exact

levenshtein

jaccard

cosine (default)

structural

Threshold Guide

Configuration

CI Integration

GitHub Actions

Exit Codes in CI

Storage Format

Version File Format

Snapshot File Format

Architecture

Core Design Decisions

Development

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`promptdiff init`

`promptdiff snapshot <name>`

`promptdiff test`

`promptdiff diff <v1> <v2>`

`promptdiff report`

`promptdiff list`

`exact`

`levenshtein`

`jaccard`

`cosine` (default)

`structural`

Packages