Test Plan — Distillery CLI surface
Context
The MCP test plan #355 covers the in-protocol API and ambient-intel paths. This plan covers the distillery CLI, which is the operator/maintenance surface used in cron, container entrypoints, and one-off scripts. The CLI is its own dispatch path — argparse → handler functions in src/distillery/cli.py — not a thin wrapper over the MCP tool layer, so it warrants its own coverage.
Surface (commit 5e4f924 / staging/api-hardening): 9 top-level subcommands + 1 maintenance subgroup.
distillery [--config PATH] [--format {text,json}] [--version] {status,health,poll,retag,gh-backfill,export,import,eval,maintenance}
distillery maintenance {classify}
Existing pytest coverage (tests/test_cli.py, tests/test_cli_export_import.py): status, health, --config, export, import, retag, maintenance classify, --version. Gaps: poll, gh-backfill, eval, --format json shape across the full surface, exit-code matrix, env-var/config-file precedence, error paths.
This plan exercises the entire CLI from a real shell against ephemeral tmp DBs — verifying behaviour the way an operator's cron job or a CI deploy script does.
Surface map
| Subcommand |
Side effect |
Net |
Existing pytest |
status |
none |
none |
yes |
health |
DB connect |
none |
yes |
poll [--source URL] |
embed + write |
yes (poll API) |
partial (FeedPoller class) |
retag [--dry-run] [--force] |
write tags |
none |
yes |
gh-backfill [--dry-run] |
write project/tags/author/metadata |
none |
none |
export --output PATH |
write file |
none |
yes |
import --input PATH [--mode {merge,replace}] [--yes] |
write DB rows |
none |
yes |
eval [--skill] [--scenarios-dir] [--save-baseline] [--baseline] [--model] [--compare-cost] |
calls Claude API |
yes (Anthropic) |
unknown |
maintenance classify |
reclassify entries |
none |
yes |
Fixtures
Each scenario uses an isolated tmp directory:
WORK=$(mktemp -d)
export DISTILLERY_DB_PATH="$WORK/test.db"
export DISTILLERY_EMBEDDING_PROVIDER=deterministic # avoid Jina/OpenAI calls
# (For some scenarios: DISTILLERY_CONFIG=$WORK/config.yaml with a tiny stanza)
trap 'rm -rf "$WORK"' EXIT
A small helper to seed entries via the MCP store layer (CLI lacks a store subcommand):
# seed.py
import asyncio, sys
from distillery.store.duckdb import DuckDBStore
from distillery.config import load_config
from distillery.embedding.deterministic import DeterministicEmbeddingProvider
from distillery.models import Entry, EntryType
async def main():
cfg = load_config()
store = DuckDBStore(cfg.store_path, embedding_provider=DeterministicEmbeddingProvider())
await store.initialize()
for i in range(5):
await store.store(Entry.create(
content=f"Seed entry {i} with distinct content about topic {i}",
entry_type=EntryType.INBOX, author="seed", tags=["cli-test", f"seed-{i}"]))
await store.close()
asyncio.run(main())
Group key
- L1 — Plumbing: argparse contract;
--help, --version, --format json, --config, env vars, invalid subcommand, exit codes. No DB needed for most.
- L2 — Read-only:
status, health against fresh and populated DB.
- L3 — Maintenance / backfill:
retag --dry-run, gh-backfill --dry-run, maintenance classify. All idempotent/dry-run paths.
- L4 — Data lifecycle:
export → import roundtrip; merge vs replace; --yes confirmation gate; malformed input.
- L5 — Feeds:
poll; poll --source URL; no-config error path.
- L6 — Eval harness:
eval --skill recall (or another small scenario set); --save-baseline / --baseline; --compare-cost.
Spawn one subagent per group. They share the worktree at /tmp/distillery-test and the venv at /tmp/distillery-test/.venv. Each scenario must pass cleanup on exit.
Group L1 — Plumbing
| # |
Scenario |
Command |
Pass criteria |
| L1.1 |
--version |
distillery --version |
exits 0; stdout matches \d+\.\d+\.\d+ |
| L1.2 |
top-level --help |
distillery --help |
exits 0; lists all 9 subcommands |
| L1.3 |
invalid subcommand |
distillery bogus |
exits ≠0; stderr mentions "invalid choice" or similar |
| L1.4 |
no subcommand |
distillery |
exits ≠0 (or 0 with help); stderr/stdout shows usage |
| L1.5 |
per-sub --help |
for each of status health poll retag gh-backfill export import eval maintenance: distillery <sub> --help |
each exits 0 |
| L1.6 |
--format json shape — status |
distillery --format json status against fresh DB |
exits 0; stdout is valid JSON; top-level keys include at minimum entry_count (or equivalent) |
| L1.7 |
--format json shape — health |
distillery --format json health |
exits 0; stdout is valid JSON; includes a status-of-store field |
| L1.8 |
--config flag overrides search |
write a minimal cfg.yaml with store: {path: $WORK/cfgtest.db}; distillery --config $WORK/cfg.yaml status |
exits 0; uses the cfg.yaml DB path (verify by file size or by seeding cfgtest.db) |
| L1.9 |
DISTILLERY_CONFIG env var |
export DISTILLERY_CONFIG=$WORK/cfg.yaml; distillery status (no --config) |
same as L1.8 — env var honored |
| L1.10 |
--config beats env var |
both set, pointing at different DBs |
--config wins (verify by which DB has the seeded entries) |
| L1.11 |
malformed config |
write cfg.yaml containing :::not yaml:::; distillery --config $WORK/cfg.yaml status |
exits ≠0; stderr names the file and the parse error (no traceback leak) |
| L1.12 |
missing config file |
distillery --config /nonexistent.yaml status |
exits ≠0; stderr message readable |
Pass: all 12 rows green. Findings to flag: any unhandled exception traceback (vs structured error message), any subcommand whose --format json returns text, any inconsistent exit-code behaviour.
Group L2 — Read-only (status, health)
Setup: fresh tmp DB.
| # |
Scenario |
Command |
Pass criteria |
| L2.1 |
status (fresh) |
distillery status |
exits 0; reports 0 entries or equivalent empty state |
| L2.2 |
status (populated) |
seed 5 entries; distillery status |
exits 0; entry count = 5; covers expected fields (counts by type/status if present) |
| L2.3 |
status JSON shape |
distillery --format json status after L2.2 |
JSON parses; entry_count = 5; per-type breakdown if present |
| L2.4 |
health on healthy DB |
distillery health |
exits 0; "OK" or equivalent |
| L2.5 |
health on broken DB |
corrupt the DB file (truncate, or set DISTILLERY_DB_PATH=/dev/null); distillery health |
exits ≠0; stderr surfaces the failure cleanly (no Python traceback leak) |
| L2.6 |
health JSON |
distillery --format json health |
JSON parses; status field present |
Pass: L2.1–4, L2.6 green. L2.5 must produce a structured error, not a raw traceback — flag if it does.
Group L3 — Maintenance / backfill
Setup: tmp DB seeded with mixed entries (some inbox, some github, some feed). gh-backfill needs entries that look like pre-#312 GitHub entries (project=null, empty tags).
| # |
Scenario |
Command |
Pass criteria |
| L3.1 |
retag dry-run |
seed 3 feed entries; distillery retag --dry-run |
exits 0; reports N entries that WOULD be updated; DB unchanged (assert tag fields identical pre/post) |
| L3.2 |
retag actual |
repeat without --dry-run |
exits 0; tags actually written; second invocation reports 0 changes (idempotent) |
| L3.3 |
retag --force |
re-run with --force after L3.2 |
exits 0; reports all N updated regardless of empty-tags state |
| L3.4 |
gh-backfill dry-run |
seed pre-#312 GitHub entries (project=null, tags=[]); distillery gh-backfill --dry-run |
exits 0; reports N entries that WOULD be updated; DB unchanged |
| L3.5 |
gh-backfill actual |
repeat without --dry-run |
exits 0; entries now have non-null project, non-empty tags including source/github and repo/<name>; idempotent on second run |
| L3.6 |
gh-backfill no-op DB |
empty DB; distillery gh-backfill |
exits 0; reports 0 candidates |
| L3.7 |
maintenance classify (no entries) |
distillery maintenance classify |
exits 0; reports 0 processed |
| L3.8 |
maintenance classify (with pending_review) |
seed 3 entries, classify each with confidence 0.3 to manufacture pending_review; distillery maintenance classify |
exits 0; reports N processed; counts by disposition |
| L3.9 |
maintenance classify --format json |
distillery --format json maintenance classify |
JSON; includes per-disposition counts |
Pass: all 9 rows green. Flag: any non-idempotent path (re-running shouldn't double-tag), any traceback on empty DB.
Group L4 — Data lifecycle (export / import)
Setup: tmp DB seeded with 5 entries + 2 feed sources.
| # |
Scenario |
Command |
Pass criteria |
| L4.1 |
export |
distillery export --output $WORK/dump.json |
exits 0; file exists; valid JSON; contains entries[] and feed_sources[]; NO embedding vectors in payload (regression guard) |
| L4.2 |
export overwrite |
run twice to same path |
second run overwrites cleanly, no error |
| L4.3 |
import merge to fresh DB |
new tmp DB; distillery import --input $WORK/dump.json --mode merge |
exits 0; entry count matches dump; recomputes embeddings (verify by distillery status showing entries) |
| L4.4 |
import merge skips existing |
re-run import on same DB |
exits 0; reports skipped count = N; total count unchanged |
| L4.5 |
import replace requires --yes |
distillery import --input $WORK/dump.json --mode replace (interactive stdin closed) |
exits ≠0 OR prompts; must NOT silently delete |
| L4.6 |
import replace --yes |
echo y | distillery import --input ... --mode replace --yes |
exits 0; pre-existing entries dropped; new ones inserted |
| L4.7 |
import malformed JSON |
echo "not json" > $WORK/bad.json && distillery import --input $WORK/bad.json --mode merge |
exits ≠0; stderr names parse error |
| L4.8 |
import roundtrip fidelity |
export → fresh DB → import → re-export → diff the two JSONs (sort keys) |
only differences should be timestamps and embedding vectors; no semantic loss |
| L4.9 |
import feed sources |
dump has 2 sources; verify after import that distillery_watch action=list (via in-process check or the MCP server if running) shows both with original poll_interval_minutes and trust_weight preserved |
|
Pass: all 9 rows green. L4.5 is critical — silent destructive replace = file as a security/safety bug.
Group L5 — Feeds (poll)
Setup: tmp DB with one watched RSS source. Use a stable tiny feed (https://github.com/<user>.atom or a vendored test feed file) — or a local stub HTTP server on :9101 returning a fixed RSS payload, to avoid network flake.
| # |
Scenario |
Command |
Pass criteria |
| L5.1 |
poll all sources (no sources) |
empty DB; distillery poll |
exits 0; reports 0 sources polled |
| L5.2 |
poll all sources (1 source) |
seed 1 RSS source; distillery poll |
exits 0; reports source polled; entry count > 0 if feed has items |
| L5.3 |
poll --source URL filter |
seed 2 sources; distillery poll --source https://example.test/a.rss |
exits 0; only that source polled (verify by which source's last_polled_at updated) |
| L5.4 |
poll --source nonexistent |
distillery poll --source https://not-watched.test/x.rss |
exits ≠0 OR exits 0 with "no matching source" — document either; must not poll all sources by accident |
| L5.5 |
poll JSON |
distillery --format json poll |
JSON; per-source result with item_count + status |
| L5.6 |
poll persistence |
after L5.2, run distillery status |
reflects ingested entries; last_polled_at populated on the source |
Pass: all 6 green. Flag: L5.4 silently polling everything when --source doesn't match.
Group L6 — Eval harness
Setup: tmp DB; requires Anthropic API key OR a stub. Likely runs against live API by default — gate with ANTHROPIC_API_KEY check; SKIP scenarios if unset.
| # |
Scenario |
Command |
Pass criteria |
| L6.1 |
eval --help |
distillery eval --help |
exits 0; lists all flags |
| L6.2 |
eval default scenarios dir |
distillery eval --skill recall (or smallest skill) |
exits 0; reports per-scenario pass/fail; respects --model default haiku |
| L6.3 |
eval --save-baseline |
distillery eval --skill recall --save-baseline $WORK/base.json |
exits 0; baseline file exists; valid JSON with scenario results + cost |
| L6.4 |
eval --baseline compare |
repeat with --baseline $WORK/base.json |
exits 0; reports regressions (none expected on identical run) |
| L6.5 |
eval --compare-cost |
distillery eval --skill recall --baseline $WORK/base.json --compare-cost |
exits 0; reports cost delta vs baseline |
| L6.6 |
eval missing scenarios dir |
distillery eval --scenarios-dir /nonexistent |
exits ≠0; readable error |
| L6.7 |
eval invalid skill |
distillery eval --skill totally-not-a-skill |
exits 0 with "0 scenarios run" OR exits ≠0 with clear error — document |
| L6.8 |
eval --format json |
distillery --format json eval --skill recall |
JSON parses; per-scenario object schema documented |
Pass: L6.1, L6.6, L6.7, L6.8 green. L6.2–L6.5 SKIP if ANTHROPIC_API_KEY not set.
Subagent dispatch
Spawn 6 parallel subagents (one per group). All share /tmp/distillery-test worktree + .venv. Each subagent uses its own mktemp -d for tmp data + DB, and cleans up on exit.
Per-subagent prompt template:
You are the Group L test runner for the distillery CLI test plan. Worktree at /tmp/distillery-test (commit 5e4f924). Use cd /tmp/distillery-test && .venv/bin/distillery … for every CLI invocation. For each scenario in your assigned group, execute exactly what's listed, capture stdout/stderr/exit code, decide PASS/FAIL/BLOCKED with one-line evidence. Do NOT modify source. Use mktemp -d for any DB or output file; clean up on exit. If a scenario needs a populated DB, use the seed.py snippet at the top of the plan. Do NOT touch the user's ~/.distillery/ or shared resources.
Final report: markdown table | # | Scenario | Result | Evidence | plus a Findings section for any real bug (not a coverage/test-naming gap).
Aggregator (this conversation): collects per-group tables, posts an aggregate report to a follow-up GitHub issue mirroring #355.
Findings to file as issues if observed
- Critical: any traceback leak instead of structured error (L1.11, L1.12, L2.5, L4.7).
- Critical: silent destructive replace without
--yes confirmation (L4.5).
- Critical:
--source URL filter falling through to all sources when no match (L5.4).
- Major: non-idempotent backfill (L3.5 second run mutating again).
- Major:
--format json returning text on any subcommand (L1.6, L1.7, L3.9, L5.5, L6.8).
- Major: export payload including embedding vectors (L4.1 regression guard — embeddings should be recomputed on import, not transported).
- Minor: tool count or version drift in
--version / --help output.
Verification before dispatch
cd /tmp/distillery-test
.venv/bin/distillery --help # CLI installs and runs
.venv/bin/python -c "from distillery.cli import main; print('ok')" # importable
.venv/bin/pytest tests/test_cli.py tests/test_cli_export_import.py -q --no-header 2>&1 | tail -3
If pytest baseline (~50 cases) is green, dispatch the 6 group subagents in parallel.
Test Plan — Distillery CLI surface
Context
The MCP test plan #355 covers the in-protocol API and ambient-intel paths. This plan covers the
distilleryCLI, which is the operator/maintenance surface used in cron, container entrypoints, and one-off scripts. The CLI is its own dispatch path — argparse → handler functions insrc/distillery/cli.py— not a thin wrapper over the MCP tool layer, so it warrants its own coverage.Surface (commit
5e4f924/staging/api-hardening): 9 top-level subcommands + 1maintenancesubgroup.Existing pytest coverage (
tests/test_cli.py,tests/test_cli_export_import.py):status,health,--config,export,import,retag,maintenance classify,--version. Gaps:poll,gh-backfill,eval,--format jsonshape across the full surface, exit-code matrix, env-var/config-file precedence, error paths.This plan exercises the entire CLI from a real shell against ephemeral tmp DBs — verifying behaviour the way an operator's cron job or a CI deploy script does.
Surface map
statushealthpoll [--source URL]retag [--dry-run] [--force]gh-backfill [--dry-run]export --output PATHimport --input PATH [--mode {merge,replace}] [--yes]eval [--skill] [--scenarios-dir] [--save-baseline] [--baseline] [--model] [--compare-cost]maintenance classifyFixtures
Each scenario uses an isolated tmp directory:
A small helper to seed entries via the MCP store layer (CLI lacks a
storesubcommand):Group key
--help,--version,--format json,--config, env vars, invalid subcommand, exit codes. No DB needed for most.status,healthagainst fresh and populated DB.retag --dry-run,gh-backfill --dry-run,maintenance classify. All idempotent/dry-run paths.export→importroundtrip; merge vs replace; --yes confirmation gate; malformed input.poll;poll --source URL; no-config error path.eval --skill recall(or another small scenario set);--save-baseline/--baseline;--compare-cost.Spawn one subagent per group. They share the worktree at
/tmp/distillery-testand the venv at/tmp/distillery-test/.venv. Each scenario must pass cleanup on exit.Group L1 — Plumbing
--versiondistillery --version\d+\.\d+\.\d+--helpdistillery --helpdistillery bogusdistillery--helpstatus health poll retag gh-backfill export import eval maintenance:distillery <sub> --help--format jsonshape — statusdistillery --format json statusagainst fresh DBentry_count(or equivalent)--format jsonshape — healthdistillery --format json health--configflag overrides searchcfg.yamlwithstore: {path: $WORK/cfgtest.db};distillery --config $WORK/cfg.yaml statusDISTILLERY_CONFIGenv varDISTILLERY_CONFIG=$WORK/cfg.yaml;distillery status(no--config)--configbeats env var--configwins (verify by which DB has the seeded entries)cfg.yamlcontaining:::not yaml:::;distillery --config $WORK/cfg.yaml statusdistillery --config /nonexistent.yaml statusPass: all 12 rows green. Findings to flag: any unhandled exception traceback (vs structured error message), any subcommand whose
--format jsonreturns text, any inconsistent exit-code behaviour.Group L2 — Read-only (status, health)
Setup: fresh tmp DB.
distillery statusdistillery statusdistillery --format json statusafter L2.2distillery healthDISTILLERY_DB_PATH=/dev/null);distillery healthdistillery --format json healthPass: L2.1–4, L2.6 green. L2.5 must produce a structured error, not a raw traceback — flag if it does.
Group L3 — Maintenance / backfill
Setup: tmp DB seeded with mixed entries (some inbox, some github, some feed).
gh-backfillneeds entries that look like pre-#312 GitHub entries (project=null, empty tags).distillery retag --dry-run--dry-run--forceafter L3.2distillery gh-backfill --dry-run--dry-runsource/githubandrepo/<name>; idempotent on second rundistillery gh-backfilldistillery maintenance classifydistillery maintenance classifydistillery --format json maintenance classifyPass: all 9 rows green. Flag: any non-idempotent path (re-running shouldn't double-tag), any traceback on empty DB.
Group L4 — Data lifecycle (export / import)
Setup: tmp DB seeded with 5 entries + 2 feed sources.
distillery export --output $WORK/dump.jsondistillery import --input $WORK/dump.json --mode mergedistillery statusshowing entries)distillery import --input $WORK/dump.json --mode replace(interactive stdin closed)echo y | distillery import --input ... --mode replace --yesecho "not json" > $WORK/bad.json && distillery import --input $WORK/bad.json --mode mergedistillery_watch action=list(via in-process check or the MCP server if running) shows both with originalpoll_interval_minutesandtrust_weightpreservedPass: all 9 rows green. L4.5 is critical — silent destructive replace = file as a security/safety bug.
Group L5 — Feeds (poll)
Setup: tmp DB with one watched RSS source. Use a stable tiny feed (
https://github.com/<user>.atomor a vendored test feed file) — or a local stub HTTP server on :9101 returning a fixed RSS payload, to avoid network flake.distillery polldistillery polldistillery poll --source https://example.test/a.rsslast_polled_atupdated)distillery poll --source https://not-watched.test/x.rssdistillery --format json polldistillery statuslast_polled_atpopulated on the sourcePass: all 6 green. Flag: L5.4 silently polling everything when --source doesn't match.
Group L6 — Eval harness
Setup: tmp DB; requires Anthropic API key OR a stub. Likely runs against live API by default — gate with
ANTHROPIC_API_KEYcheck; SKIP scenarios if unset.distillery eval --helpdistillery eval --skill recall(or smallest skill)--modeldefault haikudistillery eval --skill recall --save-baseline $WORK/base.json--baseline $WORK/base.jsondistillery eval --skill recall --baseline $WORK/base.json --compare-costdistillery eval --scenarios-dir /nonexistentdistillery eval --skill totally-not-a-skilldistillery --format json eval --skill recallPass: L6.1, L6.6, L6.7, L6.8 green. L6.2–L6.5 SKIP if
ANTHROPIC_API_KEYnot set.Subagent dispatch
Spawn 6 parallel subagents (one per group). All share
/tmp/distillery-testworktree +.venv. Each subagent uses its ownmktemp -dfor tmp data + DB, and cleans up on exit.Per-subagent prompt template:
Aggregator (this conversation): collects per-group tables, posts an aggregate report to a follow-up GitHub issue mirroring #355.
Findings to file as issues if observed
--yesconfirmation (L4.5).--source URLfilter falling through to all sources when no match (L5.4).--format jsonreturning text on any subcommand (L1.6, L1.7, L3.9, L5.5, L6.8).--version/--helpoutput.Verification before dispatch
If pytest baseline (~50 cases) is green, dispatch the 6 group subagents in parallel.