feat(plugin): skill-evolver — DSPy + GEPA evolution for SKILL.md by furukama · Pull Request #344 · HybridAIOne/hybridclaw

furukama · 2026-04-17T12:30:09Z

Summary

Problem: Existing adaptive-skills flow can only make conservative single-point edits. There's no offline, multi-iteration evolution loop that evaluates a SKILL.md against synthetic + golden + real-trace data and opens a reviewable PR.
Why it matters: Skill quality compounds — poor descriptions cause routing misses and poor bodies cause execution failures. A dedicated evolver lets operators run targeted optimization runs against concrete fitness signals before committing.
What changed: New bundled plugin at plugins/skill-evolver/ that wraps DSPy + GEPA for two optimization targets (description / body / joint), three dataset sources (synthetic / golden / traces), a Rich-based TUI, a constraint-gated apply-and-PR pipeline, docs, and tests.
What did not change: Adaptive skills, the skill runtime, SKILL.md format, and existing plugin SDK are untouched. The new plugin opts in per-skill, per-run.

Change Type

Linked Context

Closes #
Related #

Architecture

Two evolution targets

Target	Optimizable surface	Fitness
`description`	frontmatter `description:` field	judge-LM trigger-classification F1 against the real installed skill pool
`body`	SKILL.md markdown body	LLM-judge scoring (correctness/procedure/conciseness) of a follower LM executing tasks
`both`	both surfaces	body first, then description, then cross-validated for drift

Three dataset ingredients

synthetic — LLM generates positives + adversarial negatives from the skill's own text
golden — human-curated datasets/skills/<skill>/{triggers,tasks}.json
traces — skill_observations rows in the HybridClaw SQLite DB joined with session transcripts to recover the originating user prompt

Safety gates

Before any variant is committed:

body size cap (default 15 KB), description cap (default 1024 chars)
description non-empty and short-form
body keeps a top-level heading, no frontmatter leak
body may not grow more than 1.5× baseline
optional npm test run against the applied variant, rolled back on failure
optional gh pr create in --open-pr mode

Commands

hybridclaw skill-evolver list                                 # rank skills by failure rate
hybridclaw skill-evolver extract <skill>                      # write datasets/skills/<skill>/traces.json
hybridclaw skill-evolver evolve <skill> --target description  # (target required, no default)
hybridclaw skill-evolver evolve <skill> --target body --iterations 20
hybridclaw skill-evolver evolve <skill> --target both --open-pr
hybridclaw skill-evolver show <skill>                         # Rich-rendered report of last run
hybridclaw skill-evolver watch <skill>                        # live dashboard while running
hybridclaw skill-evolver tui                                  # interactive skill browser

Implementation notes

Python sidecar runs in a per-plugin venv auto-provisioned via existing pipDependencies mechanism — matches how other Python-using plugins already work. No global install required.
TS plugin exposes a command + two tools (skill_evolver_list, skill_evolver_extract). Evolve calls out to the Python bridge, which supports both captured (stdout/stderr collection) and stdio: 'inherit' passthrough modes for interactive TUI commands.
Trace extraction lazy-loads better-sqlite3 via createRequire, so plugin registration doesn't hard-fail in environments where the DB isn't built.
Frontmatter is preserved on reassembly (tags, category, etc. survive); only the targeted field(s) are mutated.

Validation

npx vitest run tests/skill-evolver-plugin.test.ts

3 TS tests pass (skill-locator parses the three search roots, returns null for missing, plugin registers a command and two tools)
Python round-trip tests for load_skill / reassemble under plugins/skill-evolver/tests/test_skill_module.py
Verified manually: plugin file tree layout, Python module import graph, dispatcher subcommand coverage
Edge cases checked: frontmatter with nested YAML lists round-trips, reassemble with override description/body preserves unrelated frontmatter
Skipped checks and why: full evolve run not exercised in CI (requires LLM keys + GEPA) — gated by the plugin venv + config

Docs And Config Impact

README, docs, or examples updated → docs/content/extensibility/skill-evolver-plugin.md
Config or environment behavior changed
Templates or workspace bootstrap files changed
No docs or config impact

Risk Notes

Security-sensitive paths touched? No — plugin runs in the existing gateway sandbox; PR creation uses gh CLI with the operator's own auth
Gateway, audit, approval, or container boundaries touched? No — new plugin only; does not modify core plugin loader, sandbox, or audit flow
Failure modes: (1) missing API keys → DSPy/GEPA errors surface in the feedback trail; (2) missing better-sqlite3 → lazy require throws a clear error only when extractTraces is called; (3) test failure on applied variant → variant is rolled back before commit

Evidence

New test coverage: tests/skill-evolver-plugin.test.ts (3 tests) + plugins/skill-evolver/tests/test_skill_module.py (3 tests)
Docs page: docs/content/extensibility/skill-evolver-plugin.md

Bundled plugin at plugins/skill-evolver that evolves SKILL.md files along two distinct optimization targets: - description: trigger-classification F1 via a judge LM picking among the real installed skill pool - body: LLM-judge scoring (correctness/procedure/conciseness) of a follower LM executing the skill against synthesized or curated tasks Evaluation data is assembled from three sources (synthetic, golden, traces) matching the hermes-agent pattern, with traces mined from the HybridClaw SQLite db (skill_observations) joined with session transcripts. TS/Node plugin surface registers a skill-evolver command (list, extract, evolve, preview, show, watch, tui) plus two tools (skill_evolver_list and skill_evolver_extract). Evolution candidates pass a constraint gate (size/shape/growth cap + optional test run) before a branch is created and a PR is optionally opened via gh. Python sidecar runs DSPy + GEPA in a per-plugin venv (auto-provisioned via the existing pipDependencies mechanism). Rich-based TUI offers an interactive skill browser plus live-refresh dashboards for the show and watch subcommands. Docs live at docs/content/extensibility/skill-evolver-plugin.md. Tests cover skill-locator parsing, plugin registration shape, and the SKILL.md frontmatter round-trip in Python. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Adds a new bundled “skill-evolver” plugin that evolves a skill’s SKILL.md (description and/or body) using a Python sidecar (DSPy + GEPA), builds evaluation datasets from synthetic/golden/trace sources, and optionally applies the best variant and opens a PR.

Changes:

Introduces the plugins/skill-evolver/ plugin: TS command + tools plus a Python evolution runner (targets, datasets, constraints, Rich TUI/reporting).
Implements trace extraction from the HybridClaw SQLite DB plus session transcripts to build trace-based datasets.
Adds docs and unit tests covering TS skill location/plugin registration and Python SKILL.md parsing/reassembly.

Reviewed changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
tests/skill-evolver-plugin.test.ts	Vitest coverage for TS skill locator + plugin registration.
plugins/skill-evolver/tests/test_skill_module.py	Python tests for SKILL.md parse/reassemble round-trips.
plugins/skill-evolver/src/trace-extractor.js	SQLite + transcript scanning to build trace datasets.
plugins/skill-evolver/src/skill-locator.js	Locates and parses skills across repo roots.
plugins/skill-evolver/src/python-bridge.js	Spawns Python module, captures output, and bootstraps local package.
plugins/skill-evolver/src/index.js	Plugin entrypoint: command dispatcher + tool registrations.
plugins/skill-evolver/src/apply-variant.js	Applies an evolved SKILL.md, runs tests, commits, optionally opens PR.
plugins/skill-evolver/python/skill_evolver/tui.py	Rich TUI: show/watch/browse runs and skills.
plugins/skill-evolver/python/skill_evolver/targets/joint.py	Joint evolution strategy (body then description + cross-validation).
plugins/skill-evolver/python/skill_evolver/targets/description.py	Description evolution target (trigger routing fitness).
plugins/skill-evolver/python/skill_evolver/targets/body.py	Body evolution target (executor/judge fitness).
plugins/skill-evolver/python/skill_evolver/targets/init.py	Target package marker.
plugins/skill-evolver/python/skill_evolver/skill_module.py	SKILL.md parsing + frontmatter-preserving reassembly.
plugins/skill-evolver/python/skill_evolver/report.py	Markdown report rendering (diffs + scores + gates).
plugins/skill-evolver/python/skill_evolver/fitness/executor.py	Body fitness: follower LM + judge rubric scoring.
plugins/skill-evolver/python/skill_evolver/fitness/classifier.py	Description fitness: routing classifier scoring (F1).
plugins/skill-evolver/python/skill_evolver/fitness/init.py	Fitness package marker.
plugins/skill-evolver/python/skill_evolver/evolve.py	Top-level evolution orchestrator writing `result.json`.
plugins/skill-evolver/python/skill_evolver/dataset/traces.py	Builds trigger/task examples from extracted trace payloads.
plugins/skill-evolver/python/skill_evolver/dataset/synthetic.py	LLM-generated synthetic triggers/tasks.
plugins/skill-evolver/python/skill_evolver/dataset/merge.py	Dedupe + train/val split utilities.
plugins/skill-evolver/python/skill_evolver/dataset/golden.py	Loads committed “golden” datasets from `datasets/skills/...`.
plugins/skill-evolver/python/skill_evolver/dataset/init.py	Dataset package marker.
plugins/skill-evolver/python/skill_evolver/constraints.py	Hard constraint gates (size/shape/growth/test suite).
plugins/skill-evolver/python/skill_evolver/cli.py	Python CLI (`python -m skill_evolver ...`).
plugins/skill-evolver/python/skill_evolver/main.py	Python module entrypoint.
plugins/skill-evolver/python/skill_evolver/init.py	Python package metadata/version.
plugins/skill-evolver/pyproject.toml	Python packaging metadata/dependencies.
plugins/skill-evolver/package.json	Plugin package metadata (Node engine + module type).
plugins/skill-evolver/hybridclaw.plugin.yaml	Plugin manifest: creds, pip deps, external deps, config schema.
docs/content/extensibility/skill-evolver-plugin.md	User documentation for the new plugin and workflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T12:38:42Z

+      api?.logger?.info(
+        {
+          skill: skill.name,
+          observations: traces.observations.length,
+          required: config.minTraceObservations,


When trace observations are below minTraceObservations, the code logs that it is “dropping traces source”, but sources is not updated (only tracesDatasetPath remains unset). This makes the run metadata/reporting misleading because --sources still includes traces. Consider removing 'traces' from sources in this branch so the recorded sources reflect what was actually used.

Copilot · 2026-04-17T12:38:43Z

+  if (envDir && path.isAbsolute(envDir)) {
+    return path.join(envDir, 'data');
+  }
+  return path.join(os.homedir(), '.hybridclaw', 'data');


HYBRIDCLAW_DATA_DIR is treated as a runtime home dir in core code and must be absolute (core throws otherwise). Here, a relative value is silently ignored and defaults to ~/.hybridclaw/data, which can cause extraction to target an unexpected location. Consider matching core behavior and throwing when set but not absolute.

Suggested change

if (envDir && path.isAbsolute(envDir)) {

return path.join(envDir, 'data');

}

return path.join(os.homedir(), '.hybridclaw', 'data');

if (!envDir) {

return path.join(os.homedir(), '.hybridclaw', 'data');

}

if (!path.isAbsolute(envDir)) {

throw new Error('HYBRIDCLAW_DATA_DIR must be an absolute path');

}

return path.join(envDir, 'data');

Copilot · 2026-04-17T12:38:43Z

+    skillName,
+    observations: enriched,
+    otherSkillObservations: enrichedOthers,
+    transcripts,
+  };


extractTraces() includes raw transcripts in the returned payload. Since handleExtract() persists the entire payload to datasets/skills/<skill>/traces.json, this can write sensitive user transcript content/PII into a repo-local file (easy to accidentally commit). Consider omitting transcripts from the serialized dataset by default (persist only derived user_prompt + minimal observation fields), or gating transcript persistence behind an explicit opt-in.

Copilot · 2026-04-17T12:38:43Z

+  const [command, ...args] = cmd.split(/\s+/).filter(Boolean);
+  if (!command) throw new Error('Empty shell command');
+  const result = spawnSync(command, args, {
+    cwd,
+    encoding: 'utf-8',
+    env: process.env,


runShell() splits testCommand on whitespace, which breaks quoted arguments and paths with spaces. Since testCommand is user-configurable, parse it with a shell-aware splitter or execute via spawnSync(..., { shell: true }) to preserve quoting semantics.

Suggested change

const [command, ...args] = cmd.split(/\s+/).filter(Boolean);

if (!command) throw new Error('Empty shell command');

const result = spawnSync(command, args, {

cwd,

encoding: 'utf-8',

env: process.env,

const command = String(cmd || '').trim();

if (!command) throw new Error('Empty shell command');

const result = spawnSync(command, {

cwd,

encoding: 'utf-8',

env: process.env,

shell: true,

Copilot · 2026-04-17T12:38:44Z

+function resolvePython() {
+  const venv = venvPythonPath();
+  if (fs.existsSync(venv)) return venv;
+  const sys = process.platform === 'win32' ? 'py' : 'python3';
+  return sys;


When the plugin .venv is missing, resolvePython() falls back to system python3/py. In that mode, ensurePackageInstalled() installs the local package with --no-deps, so required deps (dspy, gepa, etc.) will often be missing and the subsequent python -m skill_evolver run will fail in a confusing way. Consider treating missing .venv as an actionable error (e.g., instruct to run hybridclaw plugin install ./plugins/skill-evolver --yes), or proactively create the venv and install declared dependencies before running.

Suggested change

function resolvePython() {

const venv = venvPythonPath();

if (fs.existsSync(venv)) return venv;

const sys = process.platform === 'win32' ? 'py' : 'python3';

return sys;

function missingPluginVenvError(venv) {

return new Error(

`Skill Evolver plugin environment is missing: ${venv}\n` +

'Run `hybridclaw plugin install ./plugins/skill-evolver --yes` to create the plugin virtual environment and install its dependencies.',

);

}

function resolvePython() {

const venv = venvPythonPath();

if (fs.existsSync(venv)) return venv;

throw missingPluginVenvError(venv);

Copilot · 2026-04-17T12:38:45Z

+          skillName: skill.name,
+          repoRoot,
+          limit: 500,
+          includeOtherSkills: false,


handleList() computes failure rates by calling extractTraces() for each skill. extractTraces() opens SQLite and scans transcript files, so this becomes very slow on real installs. Consider a lightweight aggregate query grouped by skill_name (no transcript loading), or add an extractTraces({ includeTranscripts: false }) mode and use it here.

Suggested change

includeOtherSkills: false,

includeOtherSkills: false,

includeTranscripts: false,

Copilot · 2026-04-17T12:38:45Z

+    best_raw = result.get("bestVariantRaw")
+    if best_raw:
+        from skill_evolver.skill_module import load_skill
+        from io import StringIO  # noqa: F401 - kept for parity with SKILL parser imports
+


Report generation parses bestVariantRaw with string splitting and a naive description: line extraction. This will mis-handle valid YAML (quoted values, |- blocks, multiline descriptions, reordered keys) and can produce misleading diffs in the PR/report. Prefer parsing bestVariantRaw with the same SKILL.md parser used elsewhere (e.g., factor a load_skill_from_string() helper in skill_module.py) and diff parsed description/body.

Copilot · 2026-04-17T12:38:46Z

+    pos_val = max(1, int(len(positives) * val_fraction)) if positives else 0
+    neg_val = max(1, int(len(negatives) * val_fraction)) if negatives else 0
+    val = positives[:pos_val] + negatives[:neg_val]
+    train = positives[pos_val:] + negatives[neg_val:]
+    rng.shuffle(val)


split_triggers() can allocate all positives/negatives to val when counts are small (e.g., 1 positive => pos_val=1, leaving no positives in train). The GEPA targets use train_examples as the optimizer trainset, so an empty/degenerate train split can cause optimizer failures or no-op evolution. Adjust the split to guarantee a non-empty train set when examples exist (e.g., clamp *_val to len-1 and/or reuse train as val when too small).

Copilot · 2026-04-17T12:38:46Z

+    parts = [p for p in command.split() if p]
+    if not parts:
+        return ConstraintResult("test_suite", True, "test command empty — skipped")
+    try:
+        result = subprocess.run(


run_test_suite() splits the configured command with command.split(), which breaks quoted args and paths with spaces. Use shlex.split(command) (and a Windows-safe equivalent) or run with shell=True if shell semantics are acceptable.

Copilot · 2026-04-17T12:38:46Z

+The plugin declares its Python deps (`dspy`, `gepa`, `click`, `rich`,
+`pyyaml`, `pydantic`) in `hybridclaw.plugin.yaml`. HybridClaw's plugin
+loader provisions a per-plugin `.venv` on first use — preferring `uv` if
+available, otherwise `python3 -m venv`. No global install required.


The docs state that the plugin loader provisions a per-plugin .venv “on first use”, but the current core flow provisions plugin venvs / installs pipDependencies as part of hybridclaw plugin install ... (with approvals), not automatically on first use. To avoid confusing setup failures, consider updating this section to explicitly instruct users to run hybridclaw plugin install ./plugins/skill-evolver (or plugin check + plugin install --yes) to provision the venv and install pip deps before running hybridclaw skill-evolver ....

- drop `traces` from sources array when observation count below threshold - enforce absolute HYBRIDCLAW_DATA_DIR (matches core behavior) - strip raw transcripts from persisted traces.json to avoid PII leak - use shell:true for testCommand; repo-relative rollback path - actionable error when plugin .venv is missing, pointing at install flow - normalize CRLF in SKILL.md frontmatter parser - fix transcript lookup to include `workspace/` segment - add includeTranscripts:false fast path so `list` skips JSONL scans - extract load_skill_from_string helper; use it in report diffs - clamp val split to len-1 so GEPA trainset is never empty - shlex.split for test command (safer than naive whitespace split) - docs: clarify plugin install flow provisions the per-plugin venv Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Benedikt Koehler and others added 2 commits April 17, 2026 14:28

lint: drop unused byObservation and replace test non-null assertion

3a7eb08

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 17, 2026 12:30

Copilot started reviewing on behalf of furukama April 17, 2026 12:30 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(plugin): skill-evolver — DSPy + GEPA evolution for SKILL.md#344

feat(plugin): skill-evolver — DSPy + GEPA evolution for SKILL.md#344
furukama wants to merge 3 commits intomainfrom
feat/skill-evolver-plugin

furukama commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	includeOtherSkills: false,
	includeOtherSkills: false,
	includeTranscripts: false,

Conversation

furukama commented Apr 17, 2026

Summary

Change Type

Linked Context

Architecture

Two evolution targets

Three dataset ingredients

Safety gates

Commands

Implementation notes

Validation

Docs And Config Impact

Risk Notes

Evidence

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants