eval: integrate Multipass Structural Memory Eval (SME) for nightly diagnostic

## Goal

Integrate [M0nkeyFl0wer/multipass-structural-memory-eval](https://github.com/M0nkeyFl0wer/multipass-structural-memory-eval) (SME) as an automated diagnostic against Distillery's DuckDB+VSS store. SME tests what a memory system knows about its own *structure* — not just whether it can retrieve. It complements the existing Ragas-based eval (`.github/workflows/eval-nightly.yml`), which measures answer quality.

## Why SME

- Nine-category test menu (Cat 1–8 = graph/retrieval structure, Cat 9 = harness integration)
- Multi-corpus, A/B/C ablation methodology — surfaces brittle defaults that hide on single-pass tests
- Beta-level instrumentation; framing is **diagnostic deltas**, not absolute leaderboard scores
- Directly comparable to a flat-baseline ChromaDB adapter shipped in-repo, which gives Distillery a defensible before/after readout against vanilla RAG on the same corpus
- Cat 9 (The Handshake) tests whether the model *actually reaches* the memory in production — relevant to Distillery's MCP transport (stdio vs HTTP) and skill orchestration

## Integration approach

SME requires implementing `sme.adapters.base.SMEAdapter` (Python ABC). Three required methods, two optional — small surface area, maps cleanly onto Distillery's `DistilleryStore` protocol:

| SMEAdapter method        | Distillery mapping                                                              |
| ------------------------ | ------------------------------------------------------------------------------- |
| `ingest_corpus(corpus)`  | Loop `store.create()` over corpus dicts; embed via configured `EmbeddingProvider` |
| `query(question)`        | `store.search()` → assemble `context_string` (the exact text the LLM would see) |
| `get_graph_snapshot()`   | Distillery has no edges by default — return `(entries_as_entities, [])`. Tags become entity properties |
| `get_flat_retrieval()`   | Pure VSS top-K — already what `store.search()` does without rerank              |
| `get_ontology_source()`  | Optional; could expose the tag tree as an ontology surface                       |

Distillery is graph-light (entries + tags, no first-class edges), so several Cat 1–8 categories will score near the flat baseline by design. That **is** the finding worth publishing. Cat 9 (Handshake) is where Distillery's MCP/skills story should differentiate.

## Proposed automation

1. **New package extra** — `pip install -e \".[sme]\"` pins SME (vendored or git ref).
2. **Adapter** — `src/distillery/eval/sme_adapter.py` implementing `SMEAdapter`, wired to an in-memory DuckDB + `deterministic_embedding_provider` for hermetic CI runs and a real backend for nightly.
3. **Workflow** — new job in `.github/workflows/eval-nightly.yml` (or sibling `eval-sme.yml`):
   - Install SME from pinned ref
   - Run `sme-eval retrieve` → `analyze` → `cat8` → `cat2c` against Distillery adapter
   - Diff results against baseline JSON committed in `tests/eval/sme_baseline.json`
   - Upload full report as workflow artifact; fail only on schema/error regressions, not score drift (per SME's diagnostic-not-benchmark posture)
4. **Local run** — `make eval-sme` target mirroring the workflow for reproduction.
5. **PR gate (optional, later)** — lightweight Cat 7 token-budget check on `eval-pr.yml` if runtime stays under ~2 min.

## Acceptance criteria

- [ ] `DistilleryAdapter(SMEAdapter)` implemented with passing unit tests (in-memory store, deterministic embeddings)
- [ ] Nightly workflow job runs SME end-to-end and uploads JSON report
- [ ] Baseline report committed under `tests/eval/sme_baseline.json` with provenance comment (commit SHA + SME ref)
- [ ] `docs/` entry explaining what SME measures, what Distillery scores, and how to read the deltas
- [ ] Cat 9 (Handshake) approach scoped — even if implementation deferred, decide whether MCP-stdio or MCP-HTTP transport is the canonical \"production\" surface to test against

## Open questions

- Vendor SME or pin via git ref in `pyproject.toml`? (SME is MIT-licensed and pre-1.0; pinning a SHA is safer than a tag.)
- Which corpus shapes? SME ships test corpora; do we also need a Distillery-native corpus (e.g. derived from a sanitized snapshot of real entries) to test on representative data?
- Does the Cat 9 Handshake test require running the Claude Code CLI in CI, or can it be exercised via the MCP server directly with a stub LLM client?

## References

- Repo: https://github.com/M0nkeyFl0wer/multipass-structural-memory-eval
- Spec: `docs/sme_spec_v8.md`
- Onboarding: `docs/ideas.md`
- Adapter contract: `sme/adapters/base.py`
- Existing flat-baseline adapter: `sme/adapters/flat_baseline.py` (ChromaDB) — Distillery adapter mirrors this shape over DuckDB+VSS

SMEAdapter method	Distillery mapping
`ingest_corpus(corpus)`	Loop `store.create()` over corpus dicts; embed via configured `EmbeddingProvider`
`query(question)`	`store.search()` → assemble `context_string` (the exact text the LLM would see)
`get_graph_snapshot()`	Distillery has no edges by default — return `(entries_as_entities, [])`. Tags become entity properties
`get_flat_retrieval()`	Pure VSS top-K — already what `store.search()` does without rerank
`get_ontology_source()`	Optional; could expose the tag tree as an ontology surface

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval: integrate Multipass Structural Memory Eval (SME) for nightly diagnostic #292

Goal

Why SME

Integration approach

Proposed automation

Acceptance criteria

Open questions

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

eval: integrate Multipass Structural Memory Eval (SME) for nightly diagnostic #292

Description

Goal

Why SME

Integration approach

Proposed automation

Acceptance criteria

Open questions

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions