feat(plugin): skill-evolver — DSPy + GEPA evolution for SKILL.md#344
feat(plugin): skill-evolver — DSPy + GEPA evolution for SKILL.md#344
Conversation
Bundled plugin at plugins/skill-evolver that evolves SKILL.md files along two distinct optimization targets: - description: trigger-classification F1 via a judge LM picking among the real installed skill pool - body: LLM-judge scoring (correctness/procedure/conciseness) of a follower LM executing the skill against synthesized or curated tasks Evaluation data is assembled from three sources (synthetic, golden, traces) matching the hermes-agent pattern, with traces mined from the HybridClaw SQLite db (skill_observations) joined with session transcripts. TS/Node plugin surface registers a skill-evolver command (list, extract, evolve, preview, show, watch, tui) plus two tools (skill_evolver_list and skill_evolver_extract). Evolution candidates pass a constraint gate (size/shape/growth cap + optional test run) before a branch is created and a PR is optionally opened via gh. Python sidecar runs DSPy + GEPA in a per-plugin venv (auto-provisioned via the existing pipDependencies mechanism). Rich-based TUI offers an interactive skill browser plus live-refresh dashboards for the show and watch subcommands. Docs live at docs/content/extensibility/skill-evolver-plugin.md. Tests cover skill-locator parsing, plugin registration shape, and the SKILL.md frontmatter round-trip in Python. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new bundled “skill-evolver” plugin that evolves a skill’s SKILL.md (description and/or body) using a Python sidecar (DSPy + GEPA), builds evaluation datasets from synthetic/golden/trace sources, and optionally applies the best variant and opens a PR.
Changes:
- Introduces the
plugins/skill-evolver/plugin: TS command + tools plus a Python evolution runner (targets, datasets, constraints, Rich TUI/reporting). - Implements trace extraction from the HybridClaw SQLite DB plus session transcripts to build trace-based datasets.
- Adds docs and unit tests covering TS skill location/plugin registration and Python SKILL.md parsing/reassembly.
Reviewed changes
Copilot reviewed 31 out of 31 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/skill-evolver-plugin.test.ts | Vitest coverage for TS skill locator + plugin registration. |
| plugins/skill-evolver/tests/test_skill_module.py | Python tests for SKILL.md parse/reassemble round-trips. |
| plugins/skill-evolver/src/trace-extractor.js | SQLite + transcript scanning to build trace datasets. |
| plugins/skill-evolver/src/skill-locator.js | Locates and parses skills across repo roots. |
| plugins/skill-evolver/src/python-bridge.js | Spawns Python module, captures output, and bootstraps local package. |
| plugins/skill-evolver/src/index.js | Plugin entrypoint: command dispatcher + tool registrations. |
| plugins/skill-evolver/src/apply-variant.js | Applies an evolved SKILL.md, runs tests, commits, optionally opens PR. |
| plugins/skill-evolver/python/skill_evolver/tui.py | Rich TUI: show/watch/browse runs and skills. |
| plugins/skill-evolver/python/skill_evolver/targets/joint.py | Joint evolution strategy (body then description + cross-validation). |
| plugins/skill-evolver/python/skill_evolver/targets/description.py | Description evolution target (trigger routing fitness). |
| plugins/skill-evolver/python/skill_evolver/targets/body.py | Body evolution target (executor/judge fitness). |
| plugins/skill-evolver/python/skill_evolver/targets/init.py | Target package marker. |
| plugins/skill-evolver/python/skill_evolver/skill_module.py | SKILL.md parsing + frontmatter-preserving reassembly. |
| plugins/skill-evolver/python/skill_evolver/report.py | Markdown report rendering (diffs + scores + gates). |
| plugins/skill-evolver/python/skill_evolver/fitness/executor.py | Body fitness: follower LM + judge rubric scoring. |
| plugins/skill-evolver/python/skill_evolver/fitness/classifier.py | Description fitness: routing classifier scoring (F1). |
| plugins/skill-evolver/python/skill_evolver/fitness/init.py | Fitness package marker. |
| plugins/skill-evolver/python/skill_evolver/evolve.py | Top-level evolution orchestrator writing result.json. |
| plugins/skill-evolver/python/skill_evolver/dataset/traces.py | Builds trigger/task examples from extracted trace payloads. |
| plugins/skill-evolver/python/skill_evolver/dataset/synthetic.py | LLM-generated synthetic triggers/tasks. |
| plugins/skill-evolver/python/skill_evolver/dataset/merge.py | Dedupe + train/val split utilities. |
| plugins/skill-evolver/python/skill_evolver/dataset/golden.py | Loads committed “golden” datasets from datasets/skills/.... |
| plugins/skill-evolver/python/skill_evolver/dataset/init.py | Dataset package marker. |
| plugins/skill-evolver/python/skill_evolver/constraints.py | Hard constraint gates (size/shape/growth/test suite). |
| plugins/skill-evolver/python/skill_evolver/cli.py | Python CLI (python -m skill_evolver ...). |
| plugins/skill-evolver/python/skill_evolver/main.py | Python module entrypoint. |
| plugins/skill-evolver/python/skill_evolver/init.py | Python package metadata/version. |
| plugins/skill-evolver/pyproject.toml | Python packaging metadata/dependencies. |
| plugins/skill-evolver/package.json | Plugin package metadata (Node engine + module type). |
| plugins/skill-evolver/hybridclaw.plugin.yaml | Plugin manifest: creds, pip deps, external deps, config schema. |
| docs/content/extensibility/skill-evolver-plugin.md | User documentation for the new plugin and workflow. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| api?.logger?.info( | ||
| { | ||
| skill: skill.name, | ||
| observations: traces.observations.length, | ||
| required: config.minTraceObservations, |
There was a problem hiding this comment.
When trace observations are below minTraceObservations, the code logs that it is “dropping traces source”, but sources is not updated (only tracesDatasetPath remains unset). This makes the run metadata/reporting misleading because --sources still includes traces. Consider removing 'traces' from sources in this branch so the recorded sources reflect what was actually used.
| if (envDir && path.isAbsolute(envDir)) { | ||
| return path.join(envDir, 'data'); | ||
| } | ||
| return path.join(os.homedir(), '.hybridclaw', 'data'); |
There was a problem hiding this comment.
HYBRIDCLAW_DATA_DIR is treated as a runtime home dir in core code and must be absolute (core throws otherwise). Here, a relative value is silently ignored and defaults to ~/.hybridclaw/data, which can cause extraction to target an unexpected location. Consider matching core behavior and throwing when set but not absolute.
| if (envDir && path.isAbsolute(envDir)) { | |
| return path.join(envDir, 'data'); | |
| } | |
| return path.join(os.homedir(), '.hybridclaw', 'data'); | |
| if (!envDir) { | |
| return path.join(os.homedir(), '.hybridclaw', 'data'); | |
| } | |
| if (!path.isAbsolute(envDir)) { | |
| throw new Error('HYBRIDCLAW_DATA_DIR must be an absolute path'); | |
| } | |
| return path.join(envDir, 'data'); |
| skillName, | ||
| observations: enriched, | ||
| otherSkillObservations: enrichedOthers, | ||
| transcripts, | ||
| }; |
There was a problem hiding this comment.
extractTraces() includes raw transcripts in the returned payload. Since handleExtract() persists the entire payload to datasets/skills/<skill>/traces.json, this can write sensitive user transcript content/PII into a repo-local file (easy to accidentally commit). Consider omitting transcripts from the serialized dataset by default (persist only derived user_prompt + minimal observation fields), or gating transcript persistence behind an explicit opt-in.
| const [command, ...args] = cmd.split(/\s+/).filter(Boolean); | ||
| if (!command) throw new Error('Empty shell command'); | ||
| const result = spawnSync(command, args, { | ||
| cwd, | ||
| encoding: 'utf-8', | ||
| env: process.env, |
There was a problem hiding this comment.
runShell() splits testCommand on whitespace, which breaks quoted arguments and paths with spaces. Since testCommand is user-configurable, parse it with a shell-aware splitter or execute via spawnSync(..., { shell: true }) to preserve quoting semantics.
| const [command, ...args] = cmd.split(/\s+/).filter(Boolean); | |
| if (!command) throw new Error('Empty shell command'); | |
| const result = spawnSync(command, args, { | |
| cwd, | |
| encoding: 'utf-8', | |
| env: process.env, | |
| const command = String(cmd || '').trim(); | |
| if (!command) throw new Error('Empty shell command'); | |
| const result = spawnSync(command, { | |
| cwd, | |
| encoding: 'utf-8', | |
| env: process.env, | |
| shell: true, |
| function resolvePython() { | ||
| const venv = venvPythonPath(); | ||
| if (fs.existsSync(venv)) return venv; | ||
| const sys = process.platform === 'win32' ? 'py' : 'python3'; | ||
| return sys; |
There was a problem hiding this comment.
When the plugin .venv is missing, resolvePython() falls back to system python3/py. In that mode, ensurePackageInstalled() installs the local package with --no-deps, so required deps (dspy, gepa, etc.) will often be missing and the subsequent python -m skill_evolver run will fail in a confusing way. Consider treating missing .venv as an actionable error (e.g., instruct to run hybridclaw plugin install ./plugins/skill-evolver --yes), or proactively create the venv and install declared dependencies before running.
| function resolvePython() { | |
| const venv = venvPythonPath(); | |
| if (fs.existsSync(venv)) return venv; | |
| const sys = process.platform === 'win32' ? 'py' : 'python3'; | |
| return sys; | |
| function missingPluginVenvError(venv) { | |
| return new Error( | |
| `Skill Evolver plugin environment is missing: ${venv}\n` + | |
| 'Run `hybridclaw plugin install ./plugins/skill-evolver --yes` to create the plugin virtual environment and install its dependencies.', | |
| ); | |
| } | |
| function resolvePython() { | |
| const venv = venvPythonPath(); | |
| if (fs.existsSync(venv)) return venv; | |
| throw missingPluginVenvError(venv); |
| skillName: skill.name, | ||
| repoRoot, | ||
| limit: 500, | ||
| includeOtherSkills: false, |
There was a problem hiding this comment.
handleList() computes failure rates by calling extractTraces() for each skill. extractTraces() opens SQLite and scans transcript files, so this becomes very slow on real installs. Consider a lightweight aggregate query grouped by skill_name (no transcript loading), or add an extractTraces({ includeTranscripts: false }) mode and use it here.
| includeOtherSkills: false, | |
| includeOtherSkills: false, | |
| includeTranscripts: false, |
| best_raw = result.get("bestVariantRaw") | ||
| if best_raw: | ||
| from skill_evolver.skill_module import load_skill | ||
| from io import StringIO # noqa: F401 - kept for parity with SKILL parser imports | ||
|
|
There was a problem hiding this comment.
Report generation parses bestVariantRaw with string splitting and a naive description: line extraction. This will mis-handle valid YAML (quoted values, |- blocks, multiline descriptions, reordered keys) and can produce misleading diffs in the PR/report. Prefer parsing bestVariantRaw with the same SKILL.md parser used elsewhere (e.g., factor a load_skill_from_string() helper in skill_module.py) and diff parsed description/body.
| pos_val = max(1, int(len(positives) * val_fraction)) if positives else 0 | ||
| neg_val = max(1, int(len(negatives) * val_fraction)) if negatives else 0 | ||
| val = positives[:pos_val] + negatives[:neg_val] | ||
| train = positives[pos_val:] + negatives[neg_val:] | ||
| rng.shuffle(val) |
There was a problem hiding this comment.
split_triggers() can allocate all positives/negatives to val when counts are small (e.g., 1 positive => pos_val=1, leaving no positives in train). The GEPA targets use train_examples as the optimizer trainset, so an empty/degenerate train split can cause optimizer failures or no-op evolution. Adjust the split to guarantee a non-empty train set when examples exist (e.g., clamp *_val to len-1 and/or reuse train as val when too small).
| parts = [p for p in command.split() if p] | ||
| if not parts: | ||
| return ConstraintResult("test_suite", True, "test command empty — skipped") | ||
| try: | ||
| result = subprocess.run( |
There was a problem hiding this comment.
run_test_suite() splits the configured command with command.split(), which breaks quoted args and paths with spaces. Use shlex.split(command) (and a Windows-safe equivalent) or run with shell=True if shell semantics are acceptable.
| The plugin declares its Python deps (`dspy`, `gepa`, `click`, `rich`, | ||
| `pyyaml`, `pydantic`) in `hybridclaw.plugin.yaml`. HybridClaw's plugin | ||
| loader provisions a per-plugin `.venv` on first use — preferring `uv` if | ||
| available, otherwise `python3 -m venv`. No global install required. |
There was a problem hiding this comment.
The docs state that the plugin loader provisions a per-plugin .venv “on first use”, but the current core flow provisions plugin venvs / installs pipDependencies as part of hybridclaw plugin install ... (with approvals), not automatically on first use. To avoid confusing setup failures, consider updating this section to explicitly instruct users to run hybridclaw plugin install ./plugins/skill-evolver (or plugin check + plugin install --yes) to provision the venv and install pip deps before running hybridclaw skill-evolver ....
- drop `traces` from sources array when observation count below threshold - enforce absolute HYBRIDCLAW_DATA_DIR (matches core behavior) - strip raw transcripts from persisted traces.json to avoid PII leak - use shell:true for testCommand; repo-relative rollback path - actionable error when plugin .venv is missing, pointing at install flow - normalize CRLF in SKILL.md frontmatter parser - fix transcript lookup to include `workspace/` segment - add includeTranscripts:false fast path so `list` skips JSONL scans - extract load_skill_from_string helper; use it in report diffs - clamp val split to len-1 so GEPA trainset is never empty - shlex.split for test command (safer than naive whitespace split) - docs: clarify plugin install flow provisions the per-plugin venv Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
plugins/skill-evolver/that wraps DSPy + GEPA for two optimization targets (description / body / joint), three dataset sources (synthetic / golden / traces), a Rich-based TUI, a constraint-gated apply-and-PR pipeline, docs, and tests.Change Type
Linked Context
Architecture
Two evolution targets
descriptiondescription:fieldbodybothThree dataset ingredients
synthetic— LLM generates positives + adversarial negatives from the skill's own textgolden— human-curateddatasets/skills/<skill>/{triggers,tasks}.jsontraces—skill_observationsrows in the HybridClaw SQLite DB joined with session transcripts to recover the originating user promptSafety gates
Before any variant is committed:
npm testrun against the applied variant, rolled back on failuregh pr createin--open-prmodeCommands
Implementation notes
pipDependenciesmechanism — matches how other Python-using plugins already work. No global install required.skill_evolver_list,skill_evolver_extract). Evolve calls out to the Python bridge, which supports both captured (stdout/stderrcollection) andstdio: 'inherit'passthrough modes for interactive TUI commands.better-sqlite3viacreateRequire, so plugin registration doesn't hard-fail in environments where the DB isn't built.Validation
load_skill/reassembleunderplugins/skill-evolver/tests/test_skill_module.pyDocs And Config Impact
docs/content/extensibility/skill-evolver-plugin.mdRisk Notes
better-sqlite3→ lazyrequirethrows a clear error only whenextractTracesis called; (3) test failure on applied variant → variant is rolled back before commitEvidence
tests/skill-evolver-plugin.test.ts(3 tests) +plugins/skill-evolver/tests/test_skill_module.py(3 tests)docs/content/extensibility/skill-evolver-plugin.md