Skip to content

Test plan: Distillery CLI surface coverage #369

@norrietaylor

Description

@norrietaylor

Test Plan — Distillery CLI surface

Context

The MCP test plan #355 covers the in-protocol API and ambient-intel paths. This plan covers the distillery CLI, which is the operator/maintenance surface used in cron, container entrypoints, and one-off scripts. The CLI is its own dispatch path — argparse → handler functions in src/distillery/cli.py — not a thin wrapper over the MCP tool layer, so it warrants its own coverage.

Surface (commit 5e4f924 / staging/api-hardening): 9 top-level subcommands + 1 maintenance subgroup.

distillery [--config PATH] [--format {text,json}] [--version] {status,health,poll,retag,gh-backfill,export,import,eval,maintenance}
distillery maintenance {classify}

Existing pytest coverage (tests/test_cli.py, tests/test_cli_export_import.py): status, health, --config, export, import, retag, maintenance classify, --version. Gaps: poll, gh-backfill, eval, --format json shape across the full surface, exit-code matrix, env-var/config-file precedence, error paths.

This plan exercises the entire CLI from a real shell against ephemeral tmp DBs — verifying behaviour the way an operator's cron job or a CI deploy script does.

Surface map

Subcommand Side effect Net Existing pytest
status none none yes
health DB connect none yes
poll [--source URL] embed + write yes (poll API) partial (FeedPoller class)
retag [--dry-run] [--force] write tags none yes
gh-backfill [--dry-run] write project/tags/author/metadata none none
export --output PATH write file none yes
import --input PATH [--mode {merge,replace}] [--yes] write DB rows none yes
eval [--skill] [--scenarios-dir] [--save-baseline] [--baseline] [--model] [--compare-cost] calls Claude API yes (Anthropic) unknown
maintenance classify reclassify entries none yes

Fixtures

Each scenario uses an isolated tmp directory:

WORK=$(mktemp -d)
export DISTILLERY_DB_PATH="$WORK/test.db"
export DISTILLERY_EMBEDDING_PROVIDER=deterministic   # avoid Jina/OpenAI calls
# (For some scenarios: DISTILLERY_CONFIG=$WORK/config.yaml with a tiny stanza)
trap 'rm -rf "$WORK"' EXIT

A small helper to seed entries via the MCP store layer (CLI lacks a store subcommand):

# seed.py
import asyncio, sys
from distillery.store.duckdb import DuckDBStore
from distillery.config import load_config
from distillery.embedding.deterministic import DeterministicEmbeddingProvider
from distillery.models import Entry, EntryType

async def main():
    cfg = load_config()
    store = DuckDBStore(cfg.store_path, embedding_provider=DeterministicEmbeddingProvider())
    await store.initialize()
    for i in range(5):
        await store.store(Entry.create(
            content=f"Seed entry {i} with distinct content about topic {i}",
            entry_type=EntryType.INBOX, author="seed", tags=["cli-test", f"seed-{i}"]))
    await store.close()

asyncio.run(main())

Group key

  • L1 — Plumbing: argparse contract; --help, --version, --format json, --config, env vars, invalid subcommand, exit codes. No DB needed for most.
  • L2 — Read-only: status, health against fresh and populated DB.
  • L3 — Maintenance / backfill: retag --dry-run, gh-backfill --dry-run, maintenance classify. All idempotent/dry-run paths.
  • L4 — Data lifecycle: exportimport roundtrip; merge vs replace; --yes confirmation gate; malformed input.
  • L5 — Feeds: poll; poll --source URL; no-config error path.
  • L6 — Eval harness: eval --skill recall (or another small scenario set); --save-baseline / --baseline; --compare-cost.

Spawn one subagent per group. They share the worktree at /tmp/distillery-test and the venv at /tmp/distillery-test/.venv. Each scenario must pass cleanup on exit.


Group L1 — Plumbing

# Scenario Command Pass criteria
L1.1 --version distillery --version exits 0; stdout matches \d+\.\d+\.\d+
L1.2 top-level --help distillery --help exits 0; lists all 9 subcommands
L1.3 invalid subcommand distillery bogus exits ≠0; stderr mentions "invalid choice" or similar
L1.4 no subcommand distillery exits ≠0 (or 0 with help); stderr/stdout shows usage
L1.5 per-sub --help for each of status health poll retag gh-backfill export import eval maintenance: distillery <sub> --help each exits 0
L1.6 --format json shape — status distillery --format json status against fresh DB exits 0; stdout is valid JSON; top-level keys include at minimum entry_count (or equivalent)
L1.7 --format json shape — health distillery --format json health exits 0; stdout is valid JSON; includes a status-of-store field
L1.8 --config flag overrides search write a minimal cfg.yaml with store: {path: $WORK/cfgtest.db}; distillery --config $WORK/cfg.yaml status exits 0; uses the cfg.yaml DB path (verify by file size or by seeding cfgtest.db)
L1.9 DISTILLERY_CONFIG env var export DISTILLERY_CONFIG=$WORK/cfg.yaml; distillery status (no --config) same as L1.8 — env var honored
L1.10 --config beats env var both set, pointing at different DBs --config wins (verify by which DB has the seeded entries)
L1.11 malformed config write cfg.yaml containing :::not yaml:::; distillery --config $WORK/cfg.yaml status exits ≠0; stderr names the file and the parse error (no traceback leak)
L1.12 missing config file distillery --config /nonexistent.yaml status exits ≠0; stderr message readable

Pass: all 12 rows green. Findings to flag: any unhandled exception traceback (vs structured error message), any subcommand whose --format json returns text, any inconsistent exit-code behaviour.


Group L2 — Read-only (status, health)

Setup: fresh tmp DB.

# Scenario Command Pass criteria
L2.1 status (fresh) distillery status exits 0; reports 0 entries or equivalent empty state
L2.2 status (populated) seed 5 entries; distillery status exits 0; entry count = 5; covers expected fields (counts by type/status if present)
L2.3 status JSON shape distillery --format json status after L2.2 JSON parses; entry_count = 5; per-type breakdown if present
L2.4 health on healthy DB distillery health exits 0; "OK" or equivalent
L2.5 health on broken DB corrupt the DB file (truncate, or set DISTILLERY_DB_PATH=/dev/null); distillery health exits ≠0; stderr surfaces the failure cleanly (no Python traceback leak)
L2.6 health JSON distillery --format json health JSON parses; status field present

Pass: L2.1–4, L2.6 green. L2.5 must produce a structured error, not a raw traceback — flag if it does.


Group L3 — Maintenance / backfill

Setup: tmp DB seeded with mixed entries (some inbox, some github, some feed). gh-backfill needs entries that look like pre-#312 GitHub entries (project=null, empty tags).

# Scenario Command Pass criteria
L3.1 retag dry-run seed 3 feed entries; distillery retag --dry-run exits 0; reports N entries that WOULD be updated; DB unchanged (assert tag fields identical pre/post)
L3.2 retag actual repeat without --dry-run exits 0; tags actually written; second invocation reports 0 changes (idempotent)
L3.3 retag --force re-run with --force after L3.2 exits 0; reports all N updated regardless of empty-tags state
L3.4 gh-backfill dry-run seed pre-#312 GitHub entries (project=null, tags=[]); distillery gh-backfill --dry-run exits 0; reports N entries that WOULD be updated; DB unchanged
L3.5 gh-backfill actual repeat without --dry-run exits 0; entries now have non-null project, non-empty tags including source/github and repo/<name>; idempotent on second run
L3.6 gh-backfill no-op DB empty DB; distillery gh-backfill exits 0; reports 0 candidates
L3.7 maintenance classify (no entries) distillery maintenance classify exits 0; reports 0 processed
L3.8 maintenance classify (with pending_review) seed 3 entries, classify each with confidence 0.3 to manufacture pending_review; distillery maintenance classify exits 0; reports N processed; counts by disposition
L3.9 maintenance classify --format json distillery --format json maintenance classify JSON; includes per-disposition counts

Pass: all 9 rows green. Flag: any non-idempotent path (re-running shouldn't double-tag), any traceback on empty DB.


Group L4 — Data lifecycle (export / import)

Setup: tmp DB seeded with 5 entries + 2 feed sources.

# Scenario Command Pass criteria
L4.1 export distillery export --output $WORK/dump.json exits 0; file exists; valid JSON; contains entries[] and feed_sources[]; NO embedding vectors in payload (regression guard)
L4.2 export overwrite run twice to same path second run overwrites cleanly, no error
L4.3 import merge to fresh DB new tmp DB; distillery import --input $WORK/dump.json --mode merge exits 0; entry count matches dump; recomputes embeddings (verify by distillery status showing entries)
L4.4 import merge skips existing re-run import on same DB exits 0; reports skipped count = N; total count unchanged
L4.5 import replace requires --yes distillery import --input $WORK/dump.json --mode replace (interactive stdin closed) exits ≠0 OR prompts; must NOT silently delete
L4.6 import replace --yes echo y | distillery import --input ... --mode replace --yes exits 0; pre-existing entries dropped; new ones inserted
L4.7 import malformed JSON echo "not json" > $WORK/bad.json && distillery import --input $WORK/bad.json --mode merge exits ≠0; stderr names parse error
L4.8 import roundtrip fidelity export → fresh DB → import → re-export → diff the two JSONs (sort keys) only differences should be timestamps and embedding vectors; no semantic loss
L4.9 import feed sources dump has 2 sources; verify after import that distillery_watch action=list (via in-process check or the MCP server if running) shows both with original poll_interval_minutes and trust_weight preserved

Pass: all 9 rows green. L4.5 is critical — silent destructive replace = file as a security/safety bug.


Group L5 — Feeds (poll)

Setup: tmp DB with one watched RSS source. Use a stable tiny feed (https://github.com/<user>.atom or a vendored test feed file) — or a local stub HTTP server on :9101 returning a fixed RSS payload, to avoid network flake.

# Scenario Command Pass criteria
L5.1 poll all sources (no sources) empty DB; distillery poll exits 0; reports 0 sources polled
L5.2 poll all sources (1 source) seed 1 RSS source; distillery poll exits 0; reports source polled; entry count > 0 if feed has items
L5.3 poll --source URL filter seed 2 sources; distillery poll --source https://example.test/a.rss exits 0; only that source polled (verify by which source's last_polled_at updated)
L5.4 poll --source nonexistent distillery poll --source https://not-watched.test/x.rss exits ≠0 OR exits 0 with "no matching source" — document either; must not poll all sources by accident
L5.5 poll JSON distillery --format json poll JSON; per-source result with item_count + status
L5.6 poll persistence after L5.2, run distillery status reflects ingested entries; last_polled_at populated on the source

Pass: all 6 green. Flag: L5.4 silently polling everything when --source doesn't match.


Group L6 — Eval harness

Setup: tmp DB; requires Anthropic API key OR a stub. Likely runs against live API by default — gate with ANTHROPIC_API_KEY check; SKIP scenarios if unset.

# Scenario Command Pass criteria
L6.1 eval --help distillery eval --help exits 0; lists all flags
L6.2 eval default scenarios dir distillery eval --skill recall (or smallest skill) exits 0; reports per-scenario pass/fail; respects --model default haiku
L6.3 eval --save-baseline distillery eval --skill recall --save-baseline $WORK/base.json exits 0; baseline file exists; valid JSON with scenario results + cost
L6.4 eval --baseline compare repeat with --baseline $WORK/base.json exits 0; reports regressions (none expected on identical run)
L6.5 eval --compare-cost distillery eval --skill recall --baseline $WORK/base.json --compare-cost exits 0; reports cost delta vs baseline
L6.6 eval missing scenarios dir distillery eval --scenarios-dir /nonexistent exits ≠0; readable error
L6.7 eval invalid skill distillery eval --skill totally-not-a-skill exits 0 with "0 scenarios run" OR exits ≠0 with clear error — document
L6.8 eval --format json distillery --format json eval --skill recall JSON parses; per-scenario object schema documented

Pass: L6.1, L6.6, L6.7, L6.8 green. L6.2–L6.5 SKIP if ANTHROPIC_API_KEY not set.


Subagent dispatch

Spawn 6 parallel subagents (one per group). All share /tmp/distillery-test worktree + .venv. Each subagent uses its own mktemp -d for tmp data + DB, and cleans up on exit.

Per-subagent prompt template:

You are the Group L test runner for the distillery CLI test plan. Worktree at /tmp/distillery-test (commit 5e4f924). Use cd /tmp/distillery-test && .venv/bin/distillery … for every CLI invocation. For each scenario in your assigned group, execute exactly what's listed, capture stdout/stderr/exit code, decide PASS/FAIL/BLOCKED with one-line evidence. Do NOT modify source. Use mktemp -d for any DB or output file; clean up on exit. If a scenario needs a populated DB, use the seed.py snippet at the top of the plan. Do NOT touch the user's ~/.distillery/ or shared resources.

Final report: markdown table | # | Scenario | Result | Evidence | plus a Findings section for any real bug (not a coverage/test-naming gap).

Aggregator (this conversation): collects per-group tables, posts an aggregate report to a follow-up GitHub issue mirroring #355.


Findings to file as issues if observed

  • Critical: any traceback leak instead of structured error (L1.11, L1.12, L2.5, L4.7).
  • Critical: silent destructive replace without --yes confirmation (L4.5).
  • Critical: --source URL filter falling through to all sources when no match (L5.4).
  • Major: non-idempotent backfill (L3.5 second run mutating again).
  • Major: --format json returning text on any subcommand (L1.6, L1.7, L3.9, L5.5, L6.8).
  • Major: export payload including embedding vectors (L4.1 regression guard — embeddings should be recomputed on import, not transported).
  • Minor: tool count or version drift in --version / --help output.

Verification before dispatch

cd /tmp/distillery-test
.venv/bin/distillery --help                    # CLI installs and runs
.venv/bin/python -c "from distillery.cli import main; print('ok')"   # importable
.venv/bin/pytest tests/test_cli.py tests/test_cli_export_import.py -q --no-header 2>&1 | tail -3

If pytest baseline (~50 cases) is green, dispatch the 6 group subagents in parallel.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions