Skip to content

feat(testing): Add Golden Response System for Snapshot Testing#1607

Open
Mustafa11300 wants to merge 3 commits intomofa-org:mainfrom
Mustafa11300:feat/issue4-golden-response-system
Open

feat(testing): Add Golden Response System for Snapshot Testing#1607
Mustafa11300 wants to merge 3 commits intomofa-org:mainfrom
Mustafa11300:feat/issue4-golden-response-system

Conversation

@Mustafa11300
Copy link
Copy Markdown
Contributor

@Mustafa11300 Mustafa11300 commented Apr 9, 2026

🔍 Description

Adds golden response (snapshot) testing to the mofa-testing crate. Agent outputs are recorded as baselines, then future runs are compared against them to detect regressions automatically.

This is Issue 4 from the testing platform roadmap (Test Definition Layer, Phase 2).

Closes #1604
Depends on #1599

📌 Changes

New: tests/src/golden.rs

Core golden response module with the following public components:

Component Purpose
GoldenSnapshot Serializable record of turn outputs (JSON/YAML)
GoldenTurnSnapshot Per-turn golden record: user_input, response, tool_calls
GoldenStore Filesystem-backed read/write/list for .golden.json files
GoldenTestConfig Configuration: strict (validate) vs. update (record) modes
GoldenCompareMode Enum: Strict or Update
GoldenCompareResult Structured comparison outcome with pass/fail and diffs
GoldenDiff Per-field diff variants (response, tool count, tool name, tool args, turn count)
GoldenError Structured error type for all golden operations
Normalizer trait Pluggable text normalization for non-deterministic content
WhitespaceNormalizer Collapses whitespace before comparison
RegexNormalizer Replaces regex matches with placeholders (UUIDs, timestamps)
NormalizerChain Chains normalizers; default_chain() handles UUIDs + timestamps + whitespace
compare_golden() Standalone comparison engine with optional normalizer
run_golden_test() End-to-end golden test runner integrated with TestReport

Comparison fields:

  • Response text (with optional normalization)
  • Tool call count per turn
  • Tool call names (positional)
  • Tool call arguments (deep JSON equality)
  • Turn count

Modes:

  • Update: Executes scenario, saves outputs as new golden baseline
  • Strict: Executes scenario, compares against stored golden, reports diffs

Modified: tests/src/lib.rs

  • Added pub mod golden module registration
  • Added public re-exports for all golden types and functions

Modified: tests/Cargo.toml

  • Added tempfile = { workspace = true } as dev-dependency for test isolation

New: tests/tests/golden_tests.rs

30+ comprehensive tests covering:

  • Serialization: JSON/YAML roundtrip, multi-turn snapshots, metadata
  • GoldenStore: save/load, exists check, list snapshots, special characters in names, nonexistent file error
  • Comparison: identical outputs (no diffs), response mismatch, turn count mismatch, tool call count mismatch, tool name mismatch, tool args mismatch, multiple diffs in single comparison
  • Normalizers: whitespace collapse, UUID replacement, timestamp replacement, normalizer chain, compare-with-normalizer for whitespace and UUIDs
  • Integration: update mode saves snapshot, strict mode passes/fails, missing snapshot error, full update→strict roundtrip, strict with normalizer, multi-turn golden, tool call golden
  • Display: diff formatting, diff serialization

New: examples/golden_response_test/

File Description
README.md Usage guide with update/strict/normalizer code samples and CI workflow
goldens/weather_agent.golden.json Example golden: multi-turn weather agent with tool calls
goldens/support_agent.golden.json Example golden: support agent with ticket lookup

🧪 Testing

All new functionality is covered by tests/tests/golden_tests.rs with 30+ test cases.

Key test categories:

  1. Unit tests — Snapshot serialization, normalizers, diff formatting
  2. Store tests — Filesystem save/load/list/exists operations
  3. Comparison tests — Field-level diff detection for all mismatch types
  4. Integration tests — Full update→strict roundtrip, multi-turn, tool calls
  5. Normalizer tests — Whitespace, UUID, timestamp, chained normalizers
  6. Error path tests — Missing snapshot, nonexistent file, parse failures

💡 Usage Example

Record a golden baseline

use mofa_testing::{AgentTest, GoldenStore, GoldenTestConfig, run_golden_test};

let scenario = AgentTest::new("my_agent")
    .when_user_says("Hello")
    .then_agent_should()
    .respond_containing("Hi")
    .build()?;

// Update mode: save actual outputs as golden baseline
let config = GoldenTestConfig::update(GoldenStore::new("./goldens"));
let report = run_golden_test(&config, &scenario, &mut agent).await;

Validate against golden

// Strict mode: compare against stored golden
let config = GoldenTestConfig::strict(GoldenStore::new("./goldens"));
let report = run_golden_test(&config, &scenario, &mut agent).await;

assert_eq!(report.failed(), 0, "golden regression detected");

With normalizers (ignore UUIDs/timestamps)

use mofa_testing::NormalizerChain;

let config = GoldenTestConfig::strict(GoldenStore::new("./goldens"))
    .with_normalizer(NormalizerChain::default_chain()?);

CI Workflow

# CI: strict mode catches regressions
cargo test --test golden_tests

# Local: update baselines when behavior intentionally changes
GOLDEN_MODE=update cargo test --test golden_tests

✅ Checklist

  • Code follows project conventions and style
  • New module registered in lib.rs with public re-exports
  • Comprehensive tests added (golden_tests.rs)
  • Examples added (examples/golden_response_test/)
  • README with usage documentation included
  • Structured error types with GoldenError
  • Normalizer trait for handling non-deterministic content
  • No breaking changes to existing APIs
  • Builds on the DSL foundation from feat(testing): Add Agent Test DSL with Declarative Scenario Builder #1599
  • tempfile dev-dependency uses workspace version

Implements parameterized scenario expansion for the mofa-testing crate.
One scenario template can now expand into many concrete test cases by
substituting {{variable}} placeholders with values from parameter sets.

New components:
- ParameterSet: named variable bindings for one test variant
- ParameterMatrix: Cartesian product expansion with safety limits
- ParameterizedScenario: template + parameter sets -> expanded scenarios
- ParameterizedScenarioFile: YAML/TOML/JSON file-backed loading

Includes:
- 30+ comprehensive tests covering expansion, substitution, file
  loading, execution, edge cases, and error handling
- Example scenarios in examples/parameterized_test/ with YAML, TOML,
  and matrix expansion demonstrations
- README with usage guide and code samples

Closes #<ISSUE_NUMBER>
Implements golden response (snapshot) testing for the mofa-testing crate.
Agent outputs are recorded as baselines, then future runs are compared
against them to detect regressions automatically.

New components:
- GoldenSnapshot: serializable record of turn outputs (JSON/YAML)
- GoldenStore: filesystem-backed snapshot persistence
- GoldenTestConfig: strict (validate) vs. update (record) modes
- GoldenDiff: structured per-field diff reporting
- Normalizer trait + WhitespaceNormalizer, RegexNormalizer, NormalizerChain
- run_golden_test: end-to-end golden test runner integrated with TestReport
- compare_golden: standalone comparison engine

Includes:
- 30+ comprehensive tests covering serialization, store operations,
  diff detection, normalizers, update/strict mode, multi-turn, and
  tool call verification
- Example golden snapshots and README in examples/golden_response_test/
- tempfile added as dev-dependency for test isolation

Closes #<ISSUE_NUMBER>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(testing): Add Golden Response System for Snapshot Testing

1 participant