Goal
Integrate M0nkeyFl0wer/multipass-structural-memory-eval (SME) as an automated diagnostic against Distillery's DuckDB+VSS store. SME tests what a memory system knows about its own structure — not just whether it can retrieve. It complements the existing Ragas-based eval (.github/workflows/eval-nightly.yml), which measures answer quality.
Why SME
- Nine-category test menu (Cat 1–8 = graph/retrieval structure, Cat 9 = harness integration)
- Multi-corpus, A/B/C ablation methodology — surfaces brittle defaults that hide on single-pass tests
- Beta-level instrumentation; framing is diagnostic deltas, not absolute leaderboard scores
- Directly comparable to a flat-baseline ChromaDB adapter shipped in-repo, which gives Distillery a defensible before/after readout against vanilla RAG on the same corpus
- Cat 9 (The Handshake) tests whether the model actually reaches the memory in production — relevant to Distillery's MCP transport (stdio vs HTTP) and skill orchestration
Integration approach
SME requires implementing sme.adapters.base.SMEAdapter (Python ABC). Three required methods, two optional — small surface area, maps cleanly onto Distillery's DistilleryStore protocol:
| SMEAdapter method |
Distillery mapping |
ingest_corpus(corpus) |
Loop store.create() over corpus dicts; embed via configured EmbeddingProvider |
query(question) |
store.search() → assemble context_string (the exact text the LLM would see) |
get_graph_snapshot() |
Distillery has no edges by default — return (entries_as_entities, []). Tags become entity properties |
get_flat_retrieval() |
Pure VSS top-K — already what store.search() does without rerank |
get_ontology_source() |
Optional; could expose the tag tree as an ontology surface |
Distillery is graph-light (entries + tags, no first-class edges), so several Cat 1–8 categories will score near the flat baseline by design. That is the finding worth publishing. Cat 9 (Handshake) is where Distillery's MCP/skills story should differentiate.
Proposed automation
- New package extra —
pip install -e \".[sme]\" pins SME (vendored or git ref).
- Adapter —
src/distillery/eval/sme_adapter.py implementing SMEAdapter, wired to an in-memory DuckDB + deterministic_embedding_provider for hermetic CI runs and a real backend for nightly.
- Workflow — new job in
.github/workflows/eval-nightly.yml (or sibling eval-sme.yml):
- Install SME from pinned ref
- Run
sme-eval retrieve → analyze → cat8 → cat2c against Distillery adapter
- Diff results against baseline JSON committed in
tests/eval/sme_baseline.json
- Upload full report as workflow artifact; fail only on schema/error regressions, not score drift (per SME's diagnostic-not-benchmark posture)
- Local run —
make eval-sme target mirroring the workflow for reproduction.
- PR gate (optional, later) — lightweight Cat 7 token-budget check on
eval-pr.yml if runtime stays under ~2 min.
Acceptance criteria
Open questions
- Vendor SME or pin via git ref in
pyproject.toml? (SME is MIT-licensed and pre-1.0; pinning a SHA is safer than a tag.)
- Which corpus shapes? SME ships test corpora; do we also need a Distillery-native corpus (e.g. derived from a sanitized snapshot of real entries) to test on representative data?
- Does the Cat 9 Handshake test require running the Claude Code CLI in CI, or can it be exercised via the MCP server directly with a stub LLM client?
References
Goal
Integrate M0nkeyFl0wer/multipass-structural-memory-eval (SME) as an automated diagnostic against Distillery's DuckDB+VSS store. SME tests what a memory system knows about its own structure — not just whether it can retrieve. It complements the existing Ragas-based eval (
.github/workflows/eval-nightly.yml), which measures answer quality.Why SME
Integration approach
SME requires implementing
sme.adapters.base.SMEAdapter(Python ABC). Three required methods, two optional — small surface area, maps cleanly onto Distillery'sDistilleryStoreprotocol:ingest_corpus(corpus)store.create()over corpus dicts; embed via configuredEmbeddingProviderquery(question)store.search()→ assemblecontext_string(the exact text the LLM would see)get_graph_snapshot()(entries_as_entities, []). Tags become entity propertiesget_flat_retrieval()store.search()does without rerankget_ontology_source()Distillery is graph-light (entries + tags, no first-class edges), so several Cat 1–8 categories will score near the flat baseline by design. That is the finding worth publishing. Cat 9 (Handshake) is where Distillery's MCP/skills story should differentiate.
Proposed automation
pip install -e \".[sme]\"pins SME (vendored or git ref).src/distillery/eval/sme_adapter.pyimplementingSMEAdapter, wired to an in-memory DuckDB +deterministic_embedding_providerfor hermetic CI runs and a real backend for nightly..github/workflows/eval-nightly.yml(or siblingeval-sme.yml):sme-eval retrieve→analyze→cat8→cat2cagainst Distillery adaptertests/eval/sme_baseline.jsonmake eval-smetarget mirroring the workflow for reproduction.eval-pr.ymlif runtime stays under ~2 min.Acceptance criteria
DistilleryAdapter(SMEAdapter)implemented with passing unit tests (in-memory store, deterministic embeddings)tests/eval/sme_baseline.jsonwith provenance comment (commit SHA + SME ref)docs/entry explaining what SME measures, what Distillery scores, and how to read the deltasOpen questions
pyproject.toml? (SME is MIT-licensed and pre-1.0; pinning a SHA is safer than a tag.)References
docs/sme_spec_v8.mddocs/ideas.mdsme/adapters/base.pysme/adapters/flat_baseline.py(ChromaDB) — Distillery adapter mirrors this shape over DuckDB+VSS