Skip to content

feat(plugin): skill-evolver — DSPy + GEPA evolution for SKILL.md#344

Open
furukama wants to merge 3 commits intomainfrom
feat/skill-evolver-plugin
Open

feat(plugin): skill-evolver — DSPy + GEPA evolution for SKILL.md#344
furukama wants to merge 3 commits intomainfrom
feat/skill-evolver-plugin

Conversation

@furukama
Copy link
Copy Markdown
Contributor

Summary

  • Problem: Existing adaptive-skills flow can only make conservative single-point edits. There's no offline, multi-iteration evolution loop that evaluates a SKILL.md against synthetic + golden + real-trace data and opens a reviewable PR.
  • Why it matters: Skill quality compounds — poor descriptions cause routing misses and poor bodies cause execution failures. A dedicated evolver lets operators run targeted optimization runs against concrete fitness signals before committing.
  • What changed: New bundled plugin at plugins/skill-evolver/ that wraps DSPy + GEPA for two optimization targets (description / body / joint), three dataset sources (synthetic / golden / traces), a Rich-based TUI, a constraint-gated apply-and-PR pipeline, docs, and tests.
  • What did not change: Adaptive skills, the skill runtime, SKILL.md format, and existing plugin SDK are untouched. The new plugin opts in per-skill, per-run.

Change Type

  • Bug fix
  • Feature
  • Docs
  • Tests
  • Refactor required for the fix
  • Tooling or workflow
  • Security hardening

Linked Context

  • Closes #
  • Related #

Architecture

Two evolution targets

Target Optimizable surface Fitness
description frontmatter description: field judge-LM trigger-classification F1 against the real installed skill pool
body SKILL.md markdown body LLM-judge scoring (correctness/procedure/conciseness) of a follower LM executing tasks
both both surfaces body first, then description, then cross-validated for drift

Three dataset ingredients

  • synthetic — LLM generates positives + adversarial negatives from the skill's own text
  • golden — human-curated datasets/skills/<skill>/{triggers,tasks}.json
  • tracesskill_observations rows in the HybridClaw SQLite DB joined with session transcripts to recover the originating user prompt

Safety gates

Before any variant is committed:

  • body size cap (default 15 KB), description cap (default 1024 chars)
  • description non-empty and short-form
  • body keeps a top-level heading, no frontmatter leak
  • body may not grow more than 1.5× baseline
  • optional npm test run against the applied variant, rolled back on failure
  • optional gh pr create in --open-pr mode

Commands

hybridclaw skill-evolver list                                 # rank skills by failure rate
hybridclaw skill-evolver extract <skill>                      # write datasets/skills/<skill>/traces.json
hybridclaw skill-evolver evolve <skill> --target description  # (target required, no default)
hybridclaw skill-evolver evolve <skill> --target body --iterations 20
hybridclaw skill-evolver evolve <skill> --target both --open-pr
hybridclaw skill-evolver show <skill>                         # Rich-rendered report of last run
hybridclaw skill-evolver watch <skill>                        # live dashboard while running
hybridclaw skill-evolver tui                                  # interactive skill browser

Implementation notes

  • Python sidecar runs in a per-plugin venv auto-provisioned via existing pipDependencies mechanism — matches how other Python-using plugins already work. No global install required.
  • TS plugin exposes a command + two tools (skill_evolver_list, skill_evolver_extract). Evolve calls out to the Python bridge, which supports both captured (stdout/stderr collection) and stdio: 'inherit' passthrough modes for interactive TUI commands.
  • Trace extraction lazy-loads better-sqlite3 via createRequire, so plugin registration doesn't hard-fail in environments where the DB isn't built.
  • Frontmatter is preserved on reassembly (tags, category, etc. survive); only the targeted field(s) are mutated.

Validation

npx vitest run tests/skill-evolver-plugin.test.ts
  • 3 TS tests pass (skill-locator parses the three search roots, returns null for missing, plugin registers a command and two tools)
  • Python round-trip tests for load_skill / reassemble under plugins/skill-evolver/tests/test_skill_module.py
  • Verified manually: plugin file tree layout, Python module import graph, dispatcher subcommand coverage
  • Edge cases checked: frontmatter with nested YAML lists round-trips, reassemble with override description/body preserves unrelated frontmatter
  • Skipped checks and why: full evolve run not exercised in CI (requires LLM keys + GEPA) — gated by the plugin venv + config

Docs And Config Impact

  • README, docs, or examples updated → docs/content/extensibility/skill-evolver-plugin.md
  • Config or environment behavior changed
  • Templates or workspace bootstrap files changed
  • No docs or config impact

Risk Notes

  • Security-sensitive paths touched? No — plugin runs in the existing gateway sandbox; PR creation uses gh CLI with the operator's own auth
  • Gateway, audit, approval, or container boundaries touched? No — new plugin only; does not modify core plugin loader, sandbox, or audit flow
  • Failure modes: (1) missing API keys → DSPy/GEPA errors surface in the feedback trail; (2) missing better-sqlite3 → lazy require throws a clear error only when extractTraces is called; (3) test failure on applied variant → variant is rolled back before commit

Evidence

  • New test coverage: tests/skill-evolver-plugin.test.ts (3 tests) + plugins/skill-evolver/tests/test_skill_module.py (3 tests)
  • Docs page: docs/content/extensibility/skill-evolver-plugin.md

Benedikt Koehler and others added 2 commits April 17, 2026 14:28
Bundled plugin at plugins/skill-evolver that evolves SKILL.md files along
two distinct optimization targets:

- description: trigger-classification F1 via a judge LM picking among the
  real installed skill pool
- body: LLM-judge scoring (correctness/procedure/conciseness) of a
  follower LM executing the skill against synthesized or curated tasks

Evaluation data is assembled from three sources (synthetic, golden, traces)
matching the hermes-agent pattern, with traces mined from the HybridClaw
SQLite db (skill_observations) joined with session transcripts.

TS/Node plugin surface registers a skill-evolver command (list, extract,
evolve, preview, show, watch, tui) plus two tools (skill_evolver_list and
skill_evolver_extract). Evolution candidates pass a constraint gate
(size/shape/growth cap + optional test run) before a branch is created and
a PR is optionally opened via gh.

Python sidecar runs DSPy + GEPA in a per-plugin venv (auto-provisioned via
the existing pipDependencies mechanism). Rich-based TUI offers an
interactive skill browser plus live-refresh dashboards for the show and
watch subcommands.

Docs live at docs/content/extensibility/skill-evolver-plugin.md.
Tests cover skill-locator parsing, plugin registration shape, and the
SKILL.md frontmatter round-trip in Python.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 17, 2026 12:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new bundled “skill-evolver” plugin that evolves a skill’s SKILL.md (description and/or body) using a Python sidecar (DSPy + GEPA), builds evaluation datasets from synthetic/golden/trace sources, and optionally applies the best variant and opens a PR.

Changes:

  • Introduces the plugins/skill-evolver/ plugin: TS command + tools plus a Python evolution runner (targets, datasets, constraints, Rich TUI/reporting).
  • Implements trace extraction from the HybridClaw SQLite DB plus session transcripts to build trace-based datasets.
  • Adds docs and unit tests covering TS skill location/plugin registration and Python SKILL.md parsing/reassembly.

Reviewed changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
tests/skill-evolver-plugin.test.ts Vitest coverage for TS skill locator + plugin registration.
plugins/skill-evolver/tests/test_skill_module.py Python tests for SKILL.md parse/reassemble round-trips.
plugins/skill-evolver/src/trace-extractor.js SQLite + transcript scanning to build trace datasets.
plugins/skill-evolver/src/skill-locator.js Locates and parses skills across repo roots.
plugins/skill-evolver/src/python-bridge.js Spawns Python module, captures output, and bootstraps local package.
plugins/skill-evolver/src/index.js Plugin entrypoint: command dispatcher + tool registrations.
plugins/skill-evolver/src/apply-variant.js Applies an evolved SKILL.md, runs tests, commits, optionally opens PR.
plugins/skill-evolver/python/skill_evolver/tui.py Rich TUI: show/watch/browse runs and skills.
plugins/skill-evolver/python/skill_evolver/targets/joint.py Joint evolution strategy (body then description + cross-validation).
plugins/skill-evolver/python/skill_evolver/targets/description.py Description evolution target (trigger routing fitness).
plugins/skill-evolver/python/skill_evolver/targets/body.py Body evolution target (executor/judge fitness).
plugins/skill-evolver/python/skill_evolver/targets/init.py Target package marker.
plugins/skill-evolver/python/skill_evolver/skill_module.py SKILL.md parsing + frontmatter-preserving reassembly.
plugins/skill-evolver/python/skill_evolver/report.py Markdown report rendering (diffs + scores + gates).
plugins/skill-evolver/python/skill_evolver/fitness/executor.py Body fitness: follower LM + judge rubric scoring.
plugins/skill-evolver/python/skill_evolver/fitness/classifier.py Description fitness: routing classifier scoring (F1).
plugins/skill-evolver/python/skill_evolver/fitness/init.py Fitness package marker.
plugins/skill-evolver/python/skill_evolver/evolve.py Top-level evolution orchestrator writing result.json.
plugins/skill-evolver/python/skill_evolver/dataset/traces.py Builds trigger/task examples from extracted trace payloads.
plugins/skill-evolver/python/skill_evolver/dataset/synthetic.py LLM-generated synthetic triggers/tasks.
plugins/skill-evolver/python/skill_evolver/dataset/merge.py Dedupe + train/val split utilities.
plugins/skill-evolver/python/skill_evolver/dataset/golden.py Loads committed “golden” datasets from datasets/skills/....
plugins/skill-evolver/python/skill_evolver/dataset/init.py Dataset package marker.
plugins/skill-evolver/python/skill_evolver/constraints.py Hard constraint gates (size/shape/growth/test suite).
plugins/skill-evolver/python/skill_evolver/cli.py Python CLI (python -m skill_evolver ...).
plugins/skill-evolver/python/skill_evolver/main.py Python module entrypoint.
plugins/skill-evolver/python/skill_evolver/init.py Python package metadata/version.
plugins/skill-evolver/pyproject.toml Python packaging metadata/dependencies.
plugins/skill-evolver/package.json Plugin package metadata (Node engine + module type).
plugins/skill-evolver/hybridclaw.plugin.yaml Plugin manifest: creds, pip deps, external deps, config schema.
docs/content/extensibility/skill-evolver-plugin.md User documentation for the new plugin and workflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +190 to +194
api?.logger?.info(
{
skill: skill.name,
observations: traces.observations.length,
required: config.minTraceObservations,
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When trace observations are below minTraceObservations, the code logs that it is “dropping traces source”, but sources is not updated (only tracesDatasetPath remains unset). This makes the run metadata/reporting misleading because --sources still includes traces. Consider removing 'traces' from sources in this branch so the recorded sources reflect what was actually used.

Copilot uses AI. Check for mistakes.
Comment on lines +10 to +13
if (envDir && path.isAbsolute(envDir)) {
return path.join(envDir, 'data');
}
return path.join(os.homedir(), '.hybridclaw', 'data');
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HYBRIDCLAW_DATA_DIR is treated as a runtime home dir in core code and must be absolute (core throws otherwise). Here, a relative value is silently ignored and defaults to ~/.hybridclaw/data, which can cause extraction to target an unexpected location. Consider matching core behavior and throwing when set but not absolute.

Suggested change
if (envDir && path.isAbsolute(envDir)) {
return path.join(envDir, 'data');
}
return path.join(os.homedir(), '.hybridclaw', 'data');
if (!envDir) {
return path.join(os.homedir(), '.hybridclaw', 'data');
}
if (!path.isAbsolute(envDir)) {
throw new Error('HYBRIDCLAW_DATA_DIR must be an absolute path');
}
return path.join(envDir, 'data');

Copilot uses AI. Check for mistakes.
Comment on lines +194 to +198
skillName,
observations: enriched,
otherSkillObservations: enrichedOthers,
transcripts,
};
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extractTraces() includes raw transcripts in the returned payload. Since handleExtract() persists the entire payload to datasets/skills/<skill>/traces.json, this can write sensitive user transcript content/PII into a repo-local file (easy to accidentally commit). Consider omitting transcripts from the serialized dataset by default (persist only derived user_prompt + minimal observation fields), or gating transcript persistence behind an explicit opt-in.

Copilot uses AI. Check for mistakes.
Comment on lines +16 to +21
const [command, ...args] = cmd.split(/\s+/).filter(Boolean);
if (!command) throw new Error('Empty shell command');
const result = spawnSync(command, args, {
cwd,
encoding: 'utf-8',
env: process.env,
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runShell() splits testCommand on whitespace, which breaks quoted arguments and paths with spaces. Since testCommand is user-configurable, parse it with a shell-aware splitter or execute via spawnSync(..., { shell: true }) to preserve quoting semantics.

Suggested change
const [command, ...args] = cmd.split(/\s+/).filter(Boolean);
if (!command) throw new Error('Empty shell command');
const result = spawnSync(command, args, {
cwd,
encoding: 'utf-8',
env: process.env,
const command = String(cmd || '').trim();
if (!command) throw new Error('Empty shell command');
const result = spawnSync(command, {
cwd,
encoding: 'utf-8',
env: process.env,
shell: true,

Copilot uses AI. Check for mistakes.
Comment on lines +18 to +22
function resolvePython() {
const venv = venvPythonPath();
if (fs.existsSync(venv)) return venv;
const sys = process.platform === 'win32' ? 'py' : 'python3';
return sys;
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the plugin .venv is missing, resolvePython() falls back to system python3/py. In that mode, ensurePackageInstalled() installs the local package with --no-deps, so required deps (dspy, gepa, etc.) will often be missing and the subsequent python -m skill_evolver run will fail in a confusing way. Consider treating missing .venv as an actionable error (e.g., instruct to run hybridclaw plugin install ./plugins/skill-evolver --yes), or proactively create the venv and install declared dependencies before running.

Suggested change
function resolvePython() {
const venv = venvPythonPath();
if (fs.existsSync(venv)) return venv;
const sys = process.platform === 'win32' ? 'py' : 'python3';
return sys;
function missingPluginVenvError(venv) {
return new Error(
`Skill Evolver plugin environment is missing: ${venv}\n` +
'Run `hybridclaw plugin install ./plugins/skill-evolver --yes` to create the plugin virtual environment and install its dependencies.',
);
}
function resolvePython() {
const venv = venvPythonPath();
if (fs.existsSync(venv)) return venv;
throw missingPluginVenvError(venv);

Copilot uses AI. Check for mistakes.
skillName: skill.name,
repoRoot,
limit: 500,
includeOtherSkills: false,
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleList() computes failure rates by calling extractTraces() for each skill. extractTraces() opens SQLite and scans transcript files, so this becomes very slow on real installs. Consider a lightweight aggregate query grouped by skill_name (no transcript loading), or add an extractTraces({ includeTranscripts: false }) mode and use it here.

Suggested change
includeOtherSkills: false,
includeOtherSkills: false,
includeTranscripts: false,

Copilot uses AI. Check for mistakes.
Comment on lines +64 to +68
best_raw = result.get("bestVariantRaw")
if best_raw:
from skill_evolver.skill_module import load_skill
from io import StringIO # noqa: F401 - kept for parity with SKILL parser imports

Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Report generation parses bestVariantRaw with string splitting and a naive description: line extraction. This will mis-handle valid YAML (quoted values, |- blocks, multiline descriptions, reordered keys) and can produce misleading diffs in the PR/report. Prefer parsing bestVariantRaw with the same SKILL.md parser used elsewhere (e.g., factor a load_skill_from_string() helper in skill_module.py) and diff parsed description/body.

Copilot uses AI. Check for mistakes.
Comment on lines +64 to +68
pos_val = max(1, int(len(positives) * val_fraction)) if positives else 0
neg_val = max(1, int(len(negatives) * val_fraction)) if negatives else 0
val = positives[:pos_val] + negatives[:neg_val]
train = positives[pos_val:] + negatives[neg_val:]
rng.shuffle(val)
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split_triggers() can allocate all positives/negatives to val when counts are small (e.g., 1 positive => pos_val=1, leaving no positives in train). The GEPA targets use train_examples as the optimizer trainset, so an empty/degenerate train split can cause optimizer failures or no-op evolution. Adjust the split to guarantee a non-empty train set when examples exist (e.g., clamp *_val to len-1 and/or reuse train as val when too small).

Copilot uses AI. Check for mistakes.
Comment on lines +99 to +103
parts = [p for p in command.split() if p]
if not parts:
return ConstraintResult("test_suite", True, "test command empty — skipped")
try:
result = subprocess.run(
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_test_suite() splits the configured command with command.split(), which breaks quoted args and paths with spaces. Use shlex.split(command) (and a Windows-safe equivalent) or run with shell=True if shell semantics are acceptable.

Copilot uses AI. Check for mistakes.
Comment on lines +111 to +114
The plugin declares its Python deps (`dspy`, `gepa`, `click`, `rich`,
`pyyaml`, `pydantic`) in `hybridclaw.plugin.yaml`. HybridClaw's plugin
loader provisions a per-plugin `.venv` on first use — preferring `uv` if
available, otherwise `python3 -m venv`. No global install required.
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs state that the plugin loader provisions a per-plugin .venv “on first use”, but the current core flow provisions plugin venvs / installs pipDependencies as part of hybridclaw plugin install ... (with approvals), not automatically on first use. To avoid confusing setup failures, consider updating this section to explicitly instruct users to run hybridclaw plugin install ./plugins/skill-evolver (or plugin check + plugin install --yes) to provision the venv and install pip deps before running hybridclaw skill-evolver ....

Copilot uses AI. Check for mistakes.
- drop `traces` from sources array when observation count below threshold
- enforce absolute HYBRIDCLAW_DATA_DIR (matches core behavior)
- strip raw transcripts from persisted traces.json to avoid PII leak
- use shell:true for testCommand; repo-relative rollback path
- actionable error when plugin .venv is missing, pointing at install flow
- normalize CRLF in SKILL.md frontmatter parser
- fix transcript lookup to include `workspace/` segment
- add includeTranscripts:false fast path so `list` skips JSONL scans
- extract load_skill_from_string helper; use it in report diffs
- clamp val split to len-1 so GEPA trainset is never empty
- shlex.split for test command (safer than naive whitespace split)
- docs: clarify plugin install flow provisions the per-plugin venv

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants