Skip to content

JSLEEKR/promptdiff

Repository files navigation

promptdiff

Version Language Tests License Node

Prompt regression testing CLI — detect drift when your prompts change.

Like apidiff for APIs, but for LLM prompts.


Why This Exists

Prompts are code. But unlike code, there's no test suite that catches when your system prompt changes silently and your model starts giving worse answers.

promptdiff gives you:

  • Version tracking for your prompts (SHA-256 content hashing)
  • Golden snapshot captures of expected outputs
  • Regression detection using 5 similarity algorithms
  • CI-ready exit codes and machine-readable reports

When your prompt changes from "Be concise" to "Be thorough and detailed", promptdiff catches the drift before your users do.


Installation

npm install -g promptdiff
# or
npx promptdiff init

Quick Start

# 1. Initialize in your project
promptdiff init

# 2. Save your prompt + golden output as a snapshot
echo "You are a helpful assistant." > prompt.txt
echo "How can I help you today?" > golden-output.txt
promptdiff snapshot assistant \
  --prompt-file prompt.txt \
  --output-file golden-output.txt

# 3. Later, test a new model output against the snapshot
echo "What can I do for you?" | promptdiff test --prompt assistant --mode cosine --threshold 0.85
# exit 0 = pass, exit 1 = regression detected

# 4. See what changed between prompt versions
promptdiff diff abc123 def456 --prompt assistant

# 5. Generate a full report
promptdiff report --format markdown --output report.md

CLI Commands

promptdiff init

Initialize promptdiff in the current directory.

promptdiff init [--dir <dir>]

Creates:

  • .promptdiff/ — storage directory for versions and snapshots
  • .promptdiff.yaml — configuration file

promptdiff snapshot <name>

Capture a golden output snapshot for a prompt.

# From files
promptdiff snapshot my-system-prompt \
  --prompt-file prompt.txt \
  --output-file expected-output.txt

# With model metadata
promptdiff snapshot my-prompt \
  --prompt-file prompt.txt \
  --output-file expected.txt \
  --model gpt-4o \
  --provider openai \
  --tag "production,v2"

# Overwrite existing snapshot
promptdiff snapshot my-prompt \
  --prompt-file prompt.txt \
  --output-file new-expected.txt \
  --force

Options:

Flag Description
--prompt-file <file> Read prompt content from file
--output-file <file> Golden output file
--model <model> Model used to generate (informational)
--provider <provider> Provider name (informational)
--tag <tags> Comma-separated tags
--force Overwrite existing snapshot

promptdiff test

Run a regression test against a stored snapshot.

# Pipe actual output
echo "actual model output" | promptdiff test --prompt my-prompt --mode cosine

# From file
promptdiff test \
  --prompt my-prompt \
  --file actual-output.txt \
  --mode cosine \
  --threshold 0.90

# Exact match mode
promptdiff test --prompt my-prompt --file output.txt --mode exact

# Structural (JSON schema) mode
promptdiff test \
  --prompt my-json-prompt \
  --file response.json \
  --mode structural \
  --schema schema.json

# Output as JSON (for CI parsing)
promptdiff test --prompt my-prompt --file out.txt --json

Options:

Flag Description
--prompt <name> Prompt name to test (required)
--file <file> Actual output file to test
--stdin Read actual output from stdin
--snapshot <id> Specific snapshot ID (default: latest)
--mode <mode> Similarity mode (default: cosine)
--threshold <value> Similarity threshold 0-1 (default: 0.85)
--schema <file> JSON schema file for structural mode
--json Output result as JSON
--quiet Only exit code, no output

Exit codes:

  • 0 — Test passed
  • 1 — Regression detected (score below threshold)
  • 2 — Error (missing files, bad config, etc.)

promptdiff diff <v1> <v2>

Show the diff between two prompt versions.

# Diff by hash prefix
promptdiff diff abc123 def456 --prompt my-prompt

# With more context
promptdiff diff abc123 def456 --prompt my-prompt --context 5

# JSON output
promptdiff diff abc123 def456 --prompt my-prompt --json

Options:

Flag Description
--prompt <name> Prompt name (required)
--context <lines> Context lines in diff (default: 3)
--json Output as JSON

promptdiff report

Generate a full regression report for all tracked prompts.

# Text report (default)
promptdiff report

# Markdown report to file
promptdiff report --format markdown --output report.md

# JSON report for CI
promptdiff report --format json --output report.json

# Override similarity settings
promptdiff report --mode levenshtein --threshold 0.90

Options:

Flag Description
--format <fmt> json, text, or markdown (default: text)
--output <file> Write report to file
--suite <name> Run only prompts in a named suite from config
--mode <mode> Similarity mode for all tests
--threshold <value> Override threshold

promptdiff list

List all tracked prompts and their versions.

# List all prompts
promptdiff list

# List versions for a specific prompt
promptdiff list --prompt my-prompt

# JSON output
promptdiff list --json

Similarity Modes

promptdiff ships with 5 similarity algorithms, all implemented from scratch — no ML dependencies.

exact

Byte-for-byte equality. Score is either 1.0 (match) or 0.0 (no match).

promptdiff test --prompt my-prompt --mode exact

Best for: deterministic outputs, templated responses.

levenshtein

Normalized Levenshtein edit distance. Score = 1 - (edits / maxLen).

promptdiff test --prompt my-prompt --mode levenshtein --threshold 0.90

Best for: short outputs where character-level changes matter (e.g., structured codes, IDs).

Algorithm: Standard DP with two-row optimization for O(n*m) time and O(n) space.

jaccard

Token-level Jaccard index: |A ∩ B| / |A ∪ B|.

promptdiff test --prompt my-prompt --mode jaccard --threshold 0.70

Best for: detecting vocabulary drift. Not sensitive to word order.

Algorithm: Tokenizes on whitespace + punctuation, lowercases, then computes set intersection over union.

cosine (default)

TF-vector cosine similarity. cos(A, B) = (A · B) / (|A| × |B|).

promptdiff test --prompt my-prompt --mode cosine --threshold 0.85

Best for: general-purpose prompt regression testing. Sensitive to vocabulary but not minor reorderings.

Algorithm: Builds term-frequency sparse vectors, then computes dot product and magnitudes. No external dependencies.

structural

JSON schema validation. Returns 1.0 if the output matches the schema, partial score based on violation count.

promptdiff test \
  --prompt json-extractor \
  --file response.json \
  --mode structural \
  --schema schema.json

Best for: prompts that must produce structured JSON output (extractors, classifiers, etc.).

Schema file example (schema.json):

{
  "type": "object",
  "required": ["entities", "sentiment"],
  "properties": {
    "entities": {
      "type": "array",
      "items": { "type": "string" }
    },
    "sentiment": {
      "type": "string",
      "enum": ["positive", "negative", "neutral"]
    }
  }
}

Threshold Guide

Mode Recommended Threshold When to Use
exact 1.0 (implicit) Deterministic outputs
levenshtein 0.90 Short outputs, typo tolerance
jaccard 0.70 Bag-of-words vocabulary match
cosine 0.85 General regression testing
structural 1.0 (implicit) JSON schema compliance

Configuration

.promptdiff.yaml (auto-generated by init):

# Directory to store promptdiff data
storageDir: .promptdiff

# Default similarity mode: exact | levenshtein | jaccard | cosine | structural
defaultMode: cosine

# Default similarity threshold (0-1)
defaultThreshold: 0.85

# Report format: json | text | markdown
reportFormat: text

# Provider information (optional, stored in snapshots for reference)
provider:
  name: openai
  model: gpt-4o

# Test suites (optional, used by `report --suite`)
suites:
  - name: core-prompts
    prompts:
      - system-prompt
      - user-greeting
    mode: cosine
    threshold: 0.90

CI Integration

GitHub Actions

name: Prompt Regression Tests
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: npm install -g promptdiff

      - name: Run prompt regression tests
        run: |
          # Test each prompt against stored golden outputs
          cat outputs/system-prompt.txt | promptdiff test \
            --prompt system-prompt \
            --mode cosine \
            --threshold 0.90
          
          cat outputs/user-greeting.txt | promptdiff test \
            --prompt user-greeting \
            --mode exact

      - name: Generate markdown report
        if: always()
        run: promptdiff report --format markdown --output regression-report.md

      - name: Upload report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: regression-report
          path: regression-report.md

Exit Codes in CI

# Will fail CI if regression detected
promptdiff test --prompt my-prompt --file output.txt --threshold 0.85
echo "Exit: $?"  # 0=pass, 1=regression, 2=error

Storage Format

All data lives in .promptdiff/ (or your configured storageDir):

.promptdiff/
├── prompts/
│   └── my-prompt/
│       ├── latest           # Contains latest version hash
│       └── versions/
│           ├── abc123...json   # Prompt version 1
│           └── def456...json   # Prompt version 2
└── snapshots/
    └── my-prompt/
        ├── latest           # Contains latest snapshot ID
        ├── abc123def456.json   # Snapshot 1
        └── 789abc012def.json   # Snapshot 2

Version File Format

{
  "id": "abc123def456",
  "name": "my-prompt",
  "content": "You are a helpful assistant...",
  "hash": "sha256-hex-64-chars",
  "createdAt": "2024-01-15T12:00:00.000Z",
  "tags": ["production", "v2"],
  "metadata": {
    "author": "alice"
  }
}

Snapshot File Format

{
  "id": "abc123def456789a",
  "promptName": "my-prompt",
  "promptHash": "sha256-of-prompt-content",
  "content": "How can I help you today?",
  "model": "gpt-4o",
  "provider": "openai",
  "capturedAt": "2024-01-15T12:00:00.000Z",
  "metadata": {}
}

Architecture

src/
├── cli.ts                    # CLI entry point (commander.js)
├── config.ts                 # Config loading (.promptdiff.yaml)
├── types.ts                  # TypeScript types
├── core/
│   ├── versioner.ts          # Prompt version tracking (SHA-256)
│   ├── snapshot.ts           # Golden snapshot management
│   ├── differ.ts             # Text diff engine (LCS-based)
│   ├── tester.ts             # Regression test runner
│   └── reporter.ts           # Report generation (JSON/text/markdown)
└── similarity/
    ├── index.ts              # Unified similarity interface
    ├── levenshtein.ts        # Edit distance (two-row DP)
    ├── jaccard.ts            # Token-set Jaccard index
    ├── cosine.ts             # TF-vector cosine similarity
    └── structural.ts         # JSON schema validator

Core Design Decisions

Content-addressed versioning: Prompts are stored by SHA-256 hash of their content. Saving the same prompt twice returns the same version — no duplicates. The latest file is a pointer to the most recent hash.

Deterministic snapshot IDs: Snapshot IDs are derived from hash(promptHash + content), so capturing the same output for the same prompt version is idempotent.

Zero ML dependencies: All similarity algorithms are implemented from scratch using classical NLP techniques. No scikit-learn, no sentence-transformers, no API calls. This keeps the tool fast, offline-capable, and dependency-free.

Sparse vector representation: Cosine similarity uses Map<string, number> for TF vectors instead of dense arrays, making it efficient for long prompts with large vocabularies.

LCS-based diffing: The differ uses a true LCS (Longest Common Subsequence) algorithm for small inputs and a fast greedy LCS for large inputs (>100K cells), preventing memory exhaustion on large prompts.


Development

# Install dependencies
npm install

# Run tests
npm test

# Watch mode
npm run test:watch

# Build
npm run build

# Type check
npm run typecheck

License

MIT — see LICENSE


Built as part of the Agent Company pipeline — Round 73.

About

Prompt regression testing CLI — version tracking, golden snapshots, and drift detection for LLM prompts

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors