Test plan: Distillery CLI surface coverage

# Test Plan — Distillery CLI surface

## Context

The MCP test plan #355 covers the in-protocol API and ambient-intel paths. This plan covers the **`distillery` CLI**, which is the operator/maintenance surface used in cron, container entrypoints, and one-off scripts. The CLI is its own dispatch path — argparse → handler functions in `src/distillery/cli.py` — not a thin wrapper over the MCP tool layer, so it warrants its own coverage.

**Surface (commit `5e4f924` / `staging/api-hardening`):** 9 top-level subcommands + 1 `maintenance` subgroup.

```
distillery [--config PATH] [--format {text,json}] [--version] {status,health,poll,retag,gh-backfill,export,import,eval,maintenance}
distillery maintenance {classify}
```

Existing pytest coverage (`tests/test_cli.py`, `tests/test_cli_export_import.py`): `status`, `health`, `--config`, `export`, `import`, `retag`, `maintenance classify`, `--version`. **Gaps:** `poll`, `gh-backfill`, `eval`, `--format json` shape across the full surface, exit-code matrix, env-var/config-file precedence, error paths.

This plan exercises the entire CLI from a real shell against ephemeral tmp DBs — verifying behaviour the way an operator's cron job or a CI deploy script does.

## Surface map

| Subcommand | Side effect | Net | Existing pytest |
|---|---|---|---|
| `status` | none | none | yes |
| `health` | DB connect | none | yes |
| `poll [--source URL]` | embed + write | yes (poll API) | partial (FeedPoller class) |
| `retag [--dry-run] [--force]` | write tags | none | yes |
| `gh-backfill [--dry-run]` | write project/tags/author/metadata | none | none |
| `export --output PATH` | write file | none | yes |
| `import --input PATH [--mode {merge,replace}] [--yes]` | write DB rows | none | yes |
| `eval [--skill] [--scenarios-dir] [--save-baseline] [--baseline] [--model] [--compare-cost]` | calls Claude API | yes (Anthropic) | unknown |
| `maintenance classify` | reclassify entries | none | yes |

## Fixtures

Each scenario uses an isolated tmp directory:

```bash
WORK=$(mktemp -d)
export DISTILLERY_DB_PATH="$WORK/test.db"
export DISTILLERY_EMBEDDING_PROVIDER=deterministic   # avoid Jina/OpenAI calls
# (For some scenarios: DISTILLERY_CONFIG=$WORK/config.yaml with a tiny stanza)
trap 'rm -rf "$WORK"' EXIT
```

A small helper to seed entries via the MCP store layer (CLI lacks a `store` subcommand):

```python
# seed.py
import asyncio, sys
from distillery.store.duckdb import DuckDBStore
from distillery.config import load_config
from distillery.embedding.deterministic import DeterministicEmbeddingProvider
from distillery.models import Entry, EntryType

async def main():
    cfg = load_config()
    store = DuckDBStore(cfg.store_path, embedding_provider=DeterministicEmbeddingProvider())
    await store.initialize()
    for i in range(5):
        await store.store(Entry.create(
            content=f"Seed entry {i} with distinct content about topic {i}",
            entry_type=EntryType.INBOX, author="seed", tags=["cli-test", f"seed-{i}"]))
    await store.close()

asyncio.run(main())
```

## Group key

- **L1 — Plumbing**: argparse contract; `--help`, `--version`, `--format json`, `--config`, env vars, invalid subcommand, exit codes. No DB needed for most.
- **L2 — Read-only**: `status`, `health` against fresh and populated DB.
- **L3 — Maintenance / backfill**: `retag --dry-run`, `gh-backfill --dry-run`, `maintenance classify`. All idempotent/dry-run paths.
- **L4 — Data lifecycle**: `export` → `import` roundtrip; merge vs replace; --yes confirmation gate; malformed input.
- **L5 — Feeds**: `poll`; `poll --source URL`; no-config error path.
- **L6 — Eval harness**: `eval --skill recall` (or another small scenario set); `--save-baseline` / `--baseline`; `--compare-cost`.

Spawn one subagent per group. They share the worktree at `/tmp/distillery-test` and the venv at `/tmp/distillery-test/.venv`. Each scenario must pass cleanup on exit.

---

## Group L1 — Plumbing

| # | Scenario | Command | Pass criteria |
|---|---|---|---|
| L1.1 | `--version` | `distillery --version` | exits 0; stdout matches `\d+\.\d+\.\d+` |
| L1.2 | top-level `--help` | `distillery --help` | exits 0; lists all 9 subcommands |
| L1.3 | invalid subcommand | `distillery bogus` | exits ≠0; stderr mentions "invalid choice" or similar |
| L1.4 | no subcommand | `distillery` | exits ≠0 (or 0 with help); stderr/stdout shows usage |
| L1.5 | per-sub `--help` | for each of `status health poll retag gh-backfill export import eval maintenance`: `distillery <sub> --help` | each exits 0 |
| L1.6 | `--format json` shape — status | `distillery --format json status` against fresh DB | exits 0; stdout is valid JSON; top-level keys include at minimum `entry_count` (or equivalent) |
| L1.7 | `--format json` shape — health | `distillery --format json health` | exits 0; stdout is valid JSON; includes a status-of-store field |
| L1.8 | `--config` flag overrides search | write a minimal `cfg.yaml` with `store: {path: $WORK/cfgtest.db}`; `distillery --config $WORK/cfg.yaml status` | exits 0; uses the cfg.yaml DB path (verify by file size or by seeding cfgtest.db) |
| L1.9 | `DISTILLERY_CONFIG` env var | export `DISTILLERY_CONFIG=$WORK/cfg.yaml`; `distillery status` (no `--config`) | same as L1.8 — env var honored |
| L1.10 | `--config` beats env var | both set, pointing at different DBs | `--config` wins (verify by which DB has the seeded entries) |
| L1.11 | malformed config | write `cfg.yaml` containing `:::not yaml:::`; `distillery --config $WORK/cfg.yaml status` | exits ≠0; stderr names the file and the parse error (no traceback leak) |
| L1.12 | missing config file | `distillery --config /nonexistent.yaml status` | exits ≠0; stderr message readable |

**Pass:** all 12 rows green. **Findings to flag:** any unhandled exception traceback (vs structured error message), any subcommand whose `--format json` returns text, any inconsistent exit-code behaviour.

---

## Group L2 — Read-only (status, health)

Setup: fresh tmp DB.

| # | Scenario | Command | Pass criteria |
|---|---|---|---|
| L2.1 | status (fresh) | `distillery status` | exits 0; reports 0 entries or equivalent empty state |
| L2.2 | status (populated) | seed 5 entries; `distillery status` | exits 0; entry count = 5; covers expected fields (counts by type/status if present) |
| L2.3 | status JSON shape | `distillery --format json status` after L2.2 | JSON parses; entry_count = 5; per-type breakdown if present |
| L2.4 | health on healthy DB | `distillery health` | exits 0; "OK" or equivalent |
| L2.5 | health on broken DB | corrupt the DB file (truncate, or set `DISTILLERY_DB_PATH=/dev/null`); `distillery health` | exits ≠0; stderr surfaces the failure cleanly (no Python traceback leak) |
| L2.6 | health JSON | `distillery --format json health` | JSON parses; status field present |

**Pass:** L2.1–4, L2.6 green. **L2.5 must produce a structured error, not a raw traceback** — flag if it does.

---

## Group L3 — Maintenance / backfill

Setup: tmp DB seeded with mixed entries (some inbox, some github, some feed). `gh-backfill` needs entries that look like pre-#312 GitHub entries (project=null, empty tags).

| # | Scenario | Command | Pass criteria |
|---|---|---|---|
| L3.1 | retag dry-run | seed 3 feed entries; `distillery retag --dry-run` | exits 0; reports N entries that WOULD be updated; **DB unchanged** (assert tag fields identical pre/post) |
| L3.2 | retag actual | repeat without `--dry-run` | exits 0; tags actually written; second invocation reports 0 changes (idempotent) |
| L3.3 | retag --force | re-run with `--force` after L3.2 | exits 0; reports all N updated regardless of empty-tags state |
| L3.4 | gh-backfill dry-run | seed pre-#312 GitHub entries (project=null, tags=[]); `distillery gh-backfill --dry-run` | exits 0; reports N entries that WOULD be updated; DB unchanged |
| L3.5 | gh-backfill actual | repeat without `--dry-run` | exits 0; entries now have non-null project, non-empty tags including `source/github` and `repo/<name>`; idempotent on second run |
| L3.6 | gh-backfill no-op DB | empty DB; `distillery gh-backfill` | exits 0; reports 0 candidates |
| L3.7 | maintenance classify (no entries) | `distillery maintenance classify` | exits 0; reports 0 processed |
| L3.8 | maintenance classify (with pending_review) | seed 3 entries, classify each with confidence 0.3 to manufacture pending_review; `distillery maintenance classify` | exits 0; reports N processed; counts by disposition |
| L3.9 | maintenance classify --format json | `distillery --format json maintenance classify` | JSON; includes per-disposition counts |

**Pass:** all 9 rows green. **Flag:** any non-idempotent path (re-running shouldn't double-tag), any traceback on empty DB.

---

## Group L4 — Data lifecycle (export / import)

Setup: tmp DB seeded with 5 entries + 2 feed sources.

| # | Scenario | Command | Pass criteria |
|---|---|---|---|
| L4.1 | export | `distillery export --output $WORK/dump.json` | exits 0; file exists; valid JSON; contains entries[] and feed_sources[]; **NO embedding vectors** in payload (regression guard) |
| L4.2 | export overwrite | run twice to same path | second run overwrites cleanly, no error |
| L4.3 | import merge to fresh DB | new tmp DB; `distillery import --input $WORK/dump.json --mode merge` | exits 0; entry count matches dump; recomputes embeddings (verify by `distillery status` showing entries) |
| L4.4 | import merge skips existing | re-run import on same DB | exits 0; reports skipped count = N; total count unchanged |
| L4.5 | import replace requires --yes | `distillery import --input $WORK/dump.json --mode replace` (interactive stdin closed) | exits ≠0 OR prompts; **must NOT silently delete** |
| L4.6 | import replace --yes | `echo y \| distillery import --input ... --mode replace --yes` | exits 0; pre-existing entries dropped; new ones inserted |
| L4.7 | import malformed JSON | `echo "not json" > $WORK/bad.json && distillery import --input $WORK/bad.json --mode merge` | exits ≠0; stderr names parse error |
| L4.8 | import roundtrip fidelity | export → fresh DB → import → re-export → diff the two JSONs (sort keys) | only differences should be timestamps and embedding vectors; no semantic loss |
| L4.9 | import feed sources | dump has 2 sources; verify after import that `distillery_watch action=list` (via in-process check or the MCP server if running) shows both with original `poll_interval_minutes` and `trust_weight` preserved |

**Pass:** all 9 rows green. **L4.5 is critical — silent destructive replace = file as a security/safety bug.**

---

## Group L5 — Feeds (poll)

Setup: tmp DB with one watched RSS source. Use a stable tiny feed (`https://github.com/<user>.atom` or a vendored test feed file) — or a local stub HTTP server on :9101 returning a fixed RSS payload, to avoid network flake.

| # | Scenario | Command | Pass criteria |
|---|---|---|---|
| L5.1 | poll all sources (no sources) | empty DB; `distillery poll` | exits 0; reports 0 sources polled |
| L5.2 | poll all sources (1 source) | seed 1 RSS source; `distillery poll` | exits 0; reports source polled; entry count > 0 if feed has items |
| L5.3 | poll --source URL filter | seed 2 sources; `distillery poll --source https://example.test/a.rss` | exits 0; only that source polled (verify by which source's `last_polled_at` updated) |
| L5.4 | poll --source nonexistent | `distillery poll --source https://not-watched.test/x.rss` | exits ≠0 OR exits 0 with "no matching source" — document either; must not poll all sources by accident |
| L5.5 | poll JSON | `distillery --format json poll` | JSON; per-source result with item_count + status |
| L5.6 | poll persistence | after L5.2, run `distillery status` | reflects ingested entries; `last_polled_at` populated on the source |

**Pass:** all 6 green. **Flag:** L5.4 silently polling everything when --source doesn't match.

---

## Group L6 — Eval harness

Setup: tmp DB; requires Anthropic API key OR a stub. Likely runs against live API by default — gate with `ANTHROPIC_API_KEY` check; SKIP scenarios if unset.

| # | Scenario | Command | Pass criteria |
|---|---|---|---|
| L6.1 | eval --help | `distillery eval --help` | exits 0; lists all flags |
| L6.2 | eval default scenarios dir | `distillery eval --skill recall` (or smallest skill) | exits 0; reports per-scenario pass/fail; respects `--model` default haiku |
| L6.3 | eval --save-baseline | `distillery eval --skill recall --save-baseline $WORK/base.json` | exits 0; baseline file exists; valid JSON with scenario results + cost |
| L6.4 | eval --baseline compare | repeat with `--baseline $WORK/base.json` | exits 0; reports regressions (none expected on identical run) |
| L6.5 | eval --compare-cost | `distillery eval --skill recall --baseline $WORK/base.json --compare-cost` | exits 0; reports cost delta vs baseline |
| L6.6 | eval missing scenarios dir | `distillery eval --scenarios-dir /nonexistent` | exits ≠0; readable error |
| L6.7 | eval invalid skill | `distillery eval --skill totally-not-a-skill` | exits 0 with "0 scenarios run" OR exits ≠0 with clear error — document |
| L6.8 | eval --format json | `distillery --format json eval --skill recall` | JSON parses; per-scenario object schema documented |

**Pass:** L6.1, L6.6, L6.7, L6.8 green. **L6.2–L6.5 SKIP if `ANTHROPIC_API_KEY` not set.**

---

## Subagent dispatch

Spawn 6 parallel subagents (one per group). All share `/tmp/distillery-test` worktree + `.venv`. Each subagent uses its own `mktemp -d` for tmp data + DB, and cleans up on exit.

**Per-subagent prompt template:**

> You are the **Group L<N>** test runner for the distillery CLI test plan. Worktree at `/tmp/distillery-test` (commit `5e4f924`). Use `cd /tmp/distillery-test && .venv/bin/distillery …` for every CLI invocation. For each scenario in your assigned group, execute exactly what's listed, capture stdout/stderr/exit code, decide PASS/FAIL/BLOCKED with one-line evidence. Do NOT modify source. Use `mktemp -d` for any DB or output file; clean up on exit. If a scenario needs a populated DB, use the seed.py snippet at the top of the plan. Do NOT touch the user's `~/.distillery/` or shared resources.
>
> Final report: markdown table `| # | Scenario | Result | Evidence |` plus a Findings section for any real bug (not a coverage/test-naming gap).

**Aggregator (this conversation):** collects per-group tables, posts an aggregate report to a follow-up GitHub issue mirroring #355.

---

## Findings to file as issues if observed

- **Critical:** any traceback leak instead of structured error (L1.11, L1.12, L2.5, L4.7).
- **Critical:** silent destructive replace without `--yes` confirmation (L4.5).
- **Critical:** `--source URL` filter falling through to all sources when no match (L5.4).
- **Major:** non-idempotent backfill (L3.5 second run mutating again).
- **Major:** `--format json` returning text on any subcommand (L1.6, L1.7, L3.9, L5.5, L6.8).
- **Major:** export payload including embedding vectors (L4.1 regression guard — embeddings should be recomputed on import, not transported).
- **Minor:** tool count or version drift in `--version` / `--help` output.

## Verification before dispatch

```bash
cd /tmp/distillery-test
.venv/bin/distillery --help                    # CLI installs and runs
.venv/bin/python -c "from distillery.cli import main; print('ok')"   # importable
.venv/bin/pytest tests/test_cli.py tests/test_cli_export_import.py -q --no-header 2>&1 | tail -3
```

If pytest baseline (~50 cases) is green, dispatch the 6 group subagents in parallel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test plan: Distillery CLI surface coverage #369

Test Plan — Distillery CLI surface

Context

Surface map

Fixtures

Group key

Group L1 — Plumbing

Group L2 — Read-only (status, health)

Group L3 — Maintenance / backfill

Group L4 — Data lifecycle (export / import)

Group L5 — Feeds (poll)

Group L6 — Eval harness

Subagent dispatch

Findings to file as issues if observed

Verification before dispatch

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Subcommand	Side effect	Net	Existing pytest
`status`	none	none	yes
`health`	DB connect	none	yes
`poll [--source URL]`	embed + write	yes (poll API)	partial (FeedPoller class)
`retag [--dry-run] [--force]`	write tags	none	yes
`gh-backfill [--dry-run]`	write project/tags/author/metadata	none	none
`export --output PATH`	write file	none	yes
`import --input PATH [--mode {merge,replace}] [--yes]`	write DB rows	none	yes
`eval [--skill] [--scenarios-dir] [--save-baseline] [--baseline] [--model] [--compare-cost]`	calls Claude API	yes (Anthropic)	unknown
`maintenance classify`	reclassify entries	none	yes

#	Scenario	Command	Pass criteria
L1.1	`--version`	`distillery --version`	exits 0; stdout matches `\d+\.\d+\.\d+`
L1.2	top-level `--help`	`distillery --help`	exits 0; lists all 9 subcommands
L1.3	invalid subcommand	`distillery bogus`	exits ≠0; stderr mentions "invalid choice" or similar
L1.4	no subcommand	`distillery`	exits ≠0 (or 0 with help); stderr/stdout shows usage
L1.5	per-sub `--help`	for each of `status health poll retag gh-backfill export import eval maintenance`: `distillery <sub> --help`	each exits 0
L1.6	`--format json` shape — status	`distillery --format json status` against fresh DB	exits 0; stdout is valid JSON; top-level keys include at minimum `entry_count` (or equivalent)
L1.7	`--format json` shape — health	`distillery --format json health`	exits 0; stdout is valid JSON; includes a status-of-store field
L1.8	`--config` flag overrides search	write a minimal `cfg.yaml` with `store: {path: $WORK/cfgtest.db}`; `distillery --config $WORK/cfg.yaml status`	exits 0; uses the cfg.yaml DB path (verify by file size or by seeding cfgtest.db)
L1.9	`DISTILLERY_CONFIG` env var	export `DISTILLERY_CONFIG=$WORK/cfg.yaml`; `distillery status` (no `--config`)	same as L1.8 — env var honored
L1.10	`--config` beats env var	both set, pointing at different DBs	`--config` wins (verify by which DB has the seeded entries)
L1.11	malformed config	write `cfg.yaml` containing `:::not yaml:::`; `distillery --config $WORK/cfg.yaml status`	exits ≠0; stderr names the file and the parse error (no traceback leak)
L1.12	missing config file	`distillery --config /nonexistent.yaml status`	exits ≠0; stderr message readable

#	Scenario	Command	Pass criteria
L2.1	status (fresh)	`distillery status`	exits 0; reports 0 entries or equivalent empty state
L2.2	status (populated)	seed 5 entries; `distillery status`	exits 0; entry count = 5; covers expected fields (counts by type/status if present)
L2.3	status JSON shape	`distillery --format json status` after L2.2	JSON parses; entry_count = 5; per-type breakdown if present
L2.4	health on healthy DB	`distillery health`	exits 0; "OK" or equivalent
L2.5	health on broken DB	corrupt the DB file (truncate, or set `DISTILLERY_DB_PATH=/dev/null`); `distillery health`	exits ≠0; stderr surfaces the failure cleanly (no Python traceback leak)
L2.6	health JSON	`distillery --format json health`	JSON parses; status field present

#	Scenario	Command	Pass criteria
L3.1	retag dry-run	seed 3 feed entries; `distillery retag --dry-run`	exits 0; reports N entries that WOULD be updated; DB unchanged (assert tag fields identical pre/post)
L3.2	retag actual	repeat without `--dry-run`	exits 0; tags actually written; second invocation reports 0 changes (idempotent)
L3.3	retag --force	re-run with `--force` after L3.2	exits 0; reports all N updated regardless of empty-tags state
L3.4	gh-backfill dry-run	seed pre-#312 GitHub entries (project=null, tags=[]); `distillery gh-backfill --dry-run`	exits 0; reports N entries that WOULD be updated; DB unchanged
L3.5	gh-backfill actual	repeat without `--dry-run`	exits 0; entries now have non-null project, non-empty tags including `source/github` and `repo/<name>`; idempotent on second run
L3.6	gh-backfill no-op DB	empty DB; `distillery gh-backfill`	exits 0; reports 0 candidates
L3.7	maintenance classify (no entries)	`distillery maintenance classify`	exits 0; reports 0 processed
L3.8	maintenance classify (with pending_review)	seed 3 entries, classify each with confidence 0.3 to manufacture pending_review; `distillery maintenance classify`	exits 0; reports N processed; counts by disposition
L3.9	maintenance classify --format json	`distillery --format json maintenance classify`	JSON; includes per-disposition counts

#	Scenario	Command	Pass criteria
L4.1	export	`distillery export --output $WORK/dump.json`	exits 0; file exists; valid JSON; contains entries[] and feed_sources[]; NO embedding vectors in payload (regression guard)
L4.2	export overwrite	run twice to same path	second run overwrites cleanly, no error
L4.3	import merge to fresh DB	new tmp DB; `distillery import --input $WORK/dump.json --mode merge`	exits 0; entry count matches dump; recomputes embeddings (verify by `distillery status` showing entries)
L4.4	import merge skips existing	re-run import on same DB	exits 0; reports skipped count = N; total count unchanged
L4.5	import replace requires --yes	`distillery import --input $WORK/dump.json --mode replace` (interactive stdin closed)	exits ≠0 OR prompts; must NOT silently delete
L4.6	import replace --yes	`echo y \| distillery import --input ... --mode replace --yes`	exits 0; pre-existing entries dropped; new ones inserted
L4.7	import malformed JSON	`echo "not json" > $WORK/bad.json && distillery import --input $WORK/bad.json --mode merge`	exits ≠0; stderr names parse error
L4.8	import roundtrip fidelity	export → fresh DB → import → re-export → diff the two JSONs (sort keys)	only differences should be timestamps and embedding vectors; no semantic loss
L4.9	import feed sources	dump has 2 sources; verify after import that `distillery_watch action=list` (via in-process check or the MCP server if running) shows both with original `poll_interval_minutes` and `trust_weight` preserved

#	Scenario	Command	Pass criteria
L5.1	poll all sources (no sources)	empty DB; `distillery poll`	exits 0; reports 0 sources polled
L5.2	poll all sources (1 source)	seed 1 RSS source; `distillery poll`	exits 0; reports source polled; entry count > 0 if feed has items
L5.3	poll --source URL filter	seed 2 sources; `distillery poll --source https://example.test/a.rss`	exits 0; only that source polled (verify by which source's `last_polled_at` updated)
L5.4	poll --source nonexistent	`distillery poll --source https://not-watched.test/x.rss`	exits ≠0 OR exits 0 with "no matching source" — document either; must not poll all sources by accident
L5.5	poll JSON	`distillery --format json poll`	JSON; per-source result with item_count + status
L5.6	poll persistence	after L5.2, run `distillery status`	reflects ingested entries; `last_polled_at` populated on the source

#	Scenario	Command	Pass criteria
L6.1	eval --help	`distillery eval --help`	exits 0; lists all flags
L6.2	eval default scenarios dir	`distillery eval --skill recall` (or smallest skill)	exits 0; reports per-scenario pass/fail; respects `--model` default haiku
L6.3	eval --save-baseline	`distillery eval --skill recall --save-baseline $WORK/base.json`	exits 0; baseline file exists; valid JSON with scenario results + cost
L6.4	eval --baseline compare	repeat with `--baseline $WORK/base.json`	exits 0; reports regressions (none expected on identical run)
L6.5	eval --compare-cost	`distillery eval --skill recall --baseline $WORK/base.json --compare-cost`	exits 0; reports cost delta vs baseline
L6.6	eval missing scenarios dir	`distillery eval --scenarios-dir /nonexistent`	exits ≠0; readable error
L6.7	eval invalid skill	`distillery eval --skill totally-not-a-skill`	exits 0 with "0 scenarios run" OR exits ≠0 with clear error — document
L6.8	eval --format json	`distillery --format json eval --skill recall`	JSON parses; per-scenario object schema documented

Test plan: Distillery CLI surface coverage #369

Description

Test Plan — Distillery CLI surface

Context

Surface map

Fixtures

Group key

Group L1 — Plumbing

Group L2 — Read-only (status, health)

Group L3 — Maintenance / backfill

Group L4 — Data lifecycle (export / import)

Group L5 — Feeds (poll)

Group L6 — Eval harness

Subagent dispatch

Findings to file as issues if observed

Verification before dispatch

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions