llm-testing-toolkit

A provider-agnostic LLM testing framework for evaluating AI model outputs with regression, hallucination, quality, and toxicity checks. Built with TypeScript, designed for CI/CD pipelines.

Why?

LLM outputs are non-deterministic. Traditional testing approaches don't work. This toolkit provides structured evaluation strategies to catch:

Regressions -- responses drifting from expected baselines after model updates
Hallucinations -- fabricated facts not grounded in source material
Quality issues -- irrelevant, incoherent, or poorly formatted responses
Safety violations -- toxic content, PII leakage, blocked terms

Demo

$ npm test

 ✓ tests/unit/similarity.test.ts (21 tests) 4ms
 ✓ tests/unit/regression.test.ts (7 tests) 3ms
 ✓ tests/unit/hallucination.test.ts (4 tests) 3ms
 ✓ tests/unit/toxicity.test.ts (10 tests) 3ms
 ✓ tests/unit/quality.test.ts (11 tests) 3ms

 Test Files  5 passed (5)
      Tests  53 passed (53)
   Duration  339ms

53 unit tests covering all 4 evaluators and similarity utilities. Tests run in under 400ms.

Quick Start

# Install
npm install llm-testing-toolkit

# Initialize config
npx llm-test init

# Set API keys (see .env.example)
export OPENAI_API_KEY=sk-...

# Run tests
npx llm-test run

Architecture

┌─────────────────────────────────────────────────────┐
│                    CLI / API                         │
├─────────────────────────────────────────────────────┤
│                  Test Runner                         │
│          (parallel execution, retries)               │
├──────────┬──────────┬──────────┬────────────────────┤
│ Regression│Hallucin. │ Quality  │    Toxicity        │
│ Evaluator │Evaluator │Evaluator │    Evaluator       │
├──────────┴──────────┴──────────┴────────────────────┤
│              Provider Layer (fetch-based)             │
│         OpenAI  │  Anthropic  │  Custom HTTP         │
├─────────────────────────────────────────────────────┤
│              Reporters                               │
│         Console  │  JSON  │  HTML                    │
└─────────────────────────────────────────────────────┘

Features

Provider-Agnostic

Test any LLM through a unified interface. No SDK lock-in -- uses raw fetch.

import { ToolkitConfig } from 'llm-testing-toolkit';

const config: ToolkitConfig = {
  defaultProvider: 'openai',
  providers: {
    openai: {
      type: 'openai',
      apiKey: '$OPENAI_API_KEY',
      model: 'gpt-4o-mini',
    },
    anthropic: {
      type: 'anthropic',
      apiKey: '$ANTHROPIC_API_KEY',
      model: 'claude-sonnet-4-6',
    },
    custom: {
      type: 'custom',
      endpoint: 'https://your-api.com/v1/chat',
      headers: { Authorization: 'Bearer $CUSTOM_API_KEY' },
    },
  },
  suites: [],
  reporters: [{ type: 'console' }, { type: 'html' }],
};

Regression Testing

Compare LLM responses against saved baselines using semantic similarity.

{
  name: 'greeting-consistency',
  prompt: 'Greet the user in a friendly way.',
  evaluators: [{
    type: 'regression',
    options: {
      similarityThreshold: 0.85,
      keyPhraseThreshold: 0.7,
      mode: 'combined', // 'exact' | 'semantic' | 'combined'
    },
  }],
  baseline: 'Hello! How can I help you today?',
}

Update baselines automatically:

npx llm-test run --update-baselines

Hallucination Detection

Verify responses stay grounded in source material.

{
  name: 'grounded-summary',
  prompt: 'Summarize the provided context.',
  evaluators: [{
    type: 'hallucination',
    options: { groundingThreshold: 0.7 },
  }],
  context: 'TypeScript is developed by Microsoft...',
}

Evaluates:

Claim extraction from response
Per-claim grounding score against source documents
Contradiction detection

Quality Evaluation

Multi-dimensional response quality scoring.

{
  name: 'json-output',
  prompt: 'Return a JSON user profile.',
  evaluators: [{
    type: 'quality',
    options: {
      expectedFormat: 'json',
      jsonSchema: { required: ['name', 'email'] },
      relevanceThreshold: 0.6,
    },
  }],
}

Scores: relevance, coherence, format compliance, completeness.

Toxicity & Safety

Detect harmful content and PII leakage.

{
  name: 'safe-response',
  prompt: 'Explain password security.',
  evaluators: [{
    type: 'toxicity',
    options: {
      sensitivity: 'high',
      checkPII: true,
      customBlocklist: ['company-secret'],
    },
  }],
}

Detects: blocked terms, email addresses, phone numbers, SSNs, credit card numbers, IP addresses.

CLI

# Run all test suites
npx llm-test run

# Run specific suite
npx llm-test run --suite regression

# Multiple reporters
npx llm-test run --reporter console,html,json

# Custom config path
npx llm-test run --config ./my-config.ts

# Update regression baselines
npx llm-test run --update-baselines

# Verbose output
npx llm-test run -v

# Initialize project
npx llm-test init

Programmatic API

import {
  RegressionEvaluator,
  HallucinationEvaluator,
  QualityEvaluator,
  ToxicityEvaluator,
} from 'llm-testing-toolkit';

// Use evaluators directly
const regression = new RegressionEvaluator({ similarityThreshold: 0.8 });
const result = regression.evaluate(actualResponse, baselineResponse);
console.log(result.passed, result.score);

// Full test runner
import { TestRunner, loadConfig } from 'llm-testing-toolkit';
const config = await loadConfig();
const runner = new TestRunner(config);
const results = await runner.run();

CI/CD Integration

The included GitHub Actions workflow:

Runs lint, format, type check on Node 18 & 20
Executes unit tests with full validation
Builds the package to verify publishability

Add your API keys as repository secrets for LLM evaluation:

OPENAI_API_KEY
ANTHROPIC_API_KEY

Reports

Console

Colorized terminal output with pass/fail indicators and score breakdowns.

HTML

Single-file visual report with expandable test details, scores, and prompt/response diffs. Dark theme, zero external dependencies.

JSON

Machine-readable output for programmatic analysis and dashboards.

Project Structure

llm-testing-toolkit/
├── .github/
│   ├── workflows/llm-tests.yml    # CI pipeline (Node 18/20 matrix)
│   ├── dependabot.yml             # Automated dependency updates
│   ├── CODEOWNERS                 # Review ownership
│   └── pull_request_template.md   # PR checklist
├── src/
│   ├── providers/                 # LLM provider adapters
│   │   ├── base.provider.ts       # Abstract base with timedCall
│   │   ├── openai.provider.ts     # OpenAI chat completions
│   │   ├── anthropic.provider.ts  # Anthropic messages API
│   │   └── custom.provider.ts     # Any HTTP-based LLM
│   ├── evaluators/                # Evaluation strategies
│   │   ├── regression.evaluator.ts    # Baseline comparison
│   │   ├── hallucination.evaluator.ts # Grounding verification
│   │   ├── quality.evaluator.ts       # Multi-dimensional scoring
│   │   └── toxicity.evaluator.ts      # Safety & PII detection
│   ├── reporters/                 # Output formatters
│   │   ├── console.reporter.ts    # Colorized terminal output
│   │   ├── html.reporter.ts       # Dark theme HTML report
│   │   └── json.reporter.ts       # Machine-readable JSON
│   ├── core/                      # Framework core
│   │   ├── runner.ts              # Test runner (parallel, retries)
│   │   ├── config.ts              # Config loader + env resolution
│   │   └── suite.ts               # Type definitions
│   ├── utils/                     # Shared utilities
│   │   ├── similarity.ts          # Cosine, Levenshtein, Dice
│   │   └── logger.ts              # Colored structured logging
│   ├── cli.ts                     # Command-line interface
│   └── index.ts                   # Public API exports
├── tests/unit/                    # 53 unit tests
├── examples/                      # Example test suite configs
├── CONTRIBUTING.md
├── SECURITY.md
├── Dockerfile
└── docker-compose.yml

Development

git clone https://github.com/mustafaautomation/llm-testing-toolkit.git
cd llm-testing-toolkit
npm install
npm test              # Run unit tests
npm run typecheck     # Type checking
npm run lint          # ESLint
npm run format:check  # Prettier
npm run build         # Compile TypeScript

License

MIT

Built by Quvantic

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github		.github
examples		examples
src		src
tests/unit		tests/unit
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.nvmrc		.nvmrc
.prettierignore		.prettierignore
.prettierrc		.prettierrc
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-testing-toolkit

Table of Contents

Why?

Demo

Quick Start

Architecture

Features

Provider-Agnostic

Regression Testing

Hallucination Detection

Quality Evaluation

Toxicity & Safety

CLI

Programmatic API

CI/CD Integration

Reports

Console

HTML

JSON

Project Structure

Development

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-testing-toolkit

Table of Contents

Why?

Demo

Quick Start

Architecture

Features

Provider-Agnostic

Regression Testing

Hallucination Detection

Quality Evaluation

Toxicity & Safety

CLI

Programmatic API

CI/CD Integration

Reports

Console

HTML

JSON

Project Structure

Development

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages