Skip to content

eval: integrate Multipass Structural Memory Eval (SME) for nightly diagnostic #292

@norrietaylor

Description

@norrietaylor

Goal

Integrate M0nkeyFl0wer/multipass-structural-memory-eval (SME) as an automated diagnostic against Distillery's DuckDB+VSS store. SME tests what a memory system knows about its own structure — not just whether it can retrieve. It complements the existing Ragas-based eval (.github/workflows/eval-nightly.yml), which measures answer quality.

Why SME

  • Nine-category test menu (Cat 1–8 = graph/retrieval structure, Cat 9 = harness integration)
  • Multi-corpus, A/B/C ablation methodology — surfaces brittle defaults that hide on single-pass tests
  • Beta-level instrumentation; framing is diagnostic deltas, not absolute leaderboard scores
  • Directly comparable to a flat-baseline ChromaDB adapter shipped in-repo, which gives Distillery a defensible before/after readout against vanilla RAG on the same corpus
  • Cat 9 (The Handshake) tests whether the model actually reaches the memory in production — relevant to Distillery's MCP transport (stdio vs HTTP) and skill orchestration

Integration approach

SME requires implementing sme.adapters.base.SMEAdapter (Python ABC). Three required methods, two optional — small surface area, maps cleanly onto Distillery's DistilleryStore protocol:

SMEAdapter method Distillery mapping
ingest_corpus(corpus) Loop store.create() over corpus dicts; embed via configured EmbeddingProvider
query(question) store.search() → assemble context_string (the exact text the LLM would see)
get_graph_snapshot() Distillery has no edges by default — return (entries_as_entities, []). Tags become entity properties
get_flat_retrieval() Pure VSS top-K — already what store.search() does without rerank
get_ontology_source() Optional; could expose the tag tree as an ontology surface

Distillery is graph-light (entries + tags, no first-class edges), so several Cat 1–8 categories will score near the flat baseline by design. That is the finding worth publishing. Cat 9 (Handshake) is where Distillery's MCP/skills story should differentiate.

Proposed automation

  1. New package extrapip install -e \".[sme]\" pins SME (vendored or git ref).
  2. Adaptersrc/distillery/eval/sme_adapter.py implementing SMEAdapter, wired to an in-memory DuckDB + deterministic_embedding_provider for hermetic CI runs and a real backend for nightly.
  3. Workflow — new job in .github/workflows/eval-nightly.yml (or sibling eval-sme.yml):
    • Install SME from pinned ref
    • Run sme-eval retrieveanalyzecat8cat2c against Distillery adapter
    • Diff results against baseline JSON committed in tests/eval/sme_baseline.json
    • Upload full report as workflow artifact; fail only on schema/error regressions, not score drift (per SME's diagnostic-not-benchmark posture)
  4. Local runmake eval-sme target mirroring the workflow for reproduction.
  5. PR gate (optional, later) — lightweight Cat 7 token-budget check on eval-pr.yml if runtime stays under ~2 min.

Acceptance criteria

  • DistilleryAdapter(SMEAdapter) implemented with passing unit tests (in-memory store, deterministic embeddings)
  • Nightly workflow job runs SME end-to-end and uploads JSON report
  • Baseline report committed under tests/eval/sme_baseline.json with provenance comment (commit SHA + SME ref)
  • docs/ entry explaining what SME measures, what Distillery scores, and how to read the deltas
  • Cat 9 (Handshake) approach scoped — even if implementation deferred, decide whether MCP-stdio or MCP-HTTP transport is the canonical "production" surface to test against

Open questions

  • Vendor SME or pin via git ref in pyproject.toml? (SME is MIT-licensed and pre-1.0; pinning a SHA is safer than a tag.)
  • Which corpus shapes? SME ships test corpora; do we also need a Distillery-native corpus (e.g. derived from a sanitized snapshot of real entries) to test on representative data?
  • Does the Cat 9 Handshake test require running the Claude Code CLI in CI, or can it be exercised via the MCP server directly with a stub LLM client?

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions