Agent Skills

A canonical skill library and quality system for AI coding agents — Codex, Claude Code, and Gemini.

Edit a skill once. Run just sync. It propagates everywhere.

Why this exists

Running the same skill across three AI runtimes without a shared source means they diverge. Evaluating skill quality by hand doesn't scale. Letting agents self-modify skills without human gates is unsafe.

This repository solves all three:

One source → skills authored here, projected to all runtimes via just sync
Automated quality → tiered CI gates score every skill change against a benchmark baseline
Human-gated improvement → the Skill Genome Loop proposes changes; humans approve every promotion

What you get

1. Cross-runtime skill library

One canonical library, symlinked to every runtime on sync:

Runtime	Install location	Index format
Codex	`~/.codex/skills/`	Native skill folders
Claude Code	`~/.claude/skills/`	Native skill folders
Gemini / Antigravity	`~/.gemini/antigravity/skills/`	Folder + `skills.txt` index

Skills are organized by domain: auth/, backend/, frontend/, github/, ops/, product/, utilities/. Running just sync also projects MCP server configs from ~/.codex/config.toml into Antigravity via scripts/sync_mcp.py.

2. Deterministic skill router

utilities/skill-builder/scripts/skill_router.py routes natural-language queries to the most relevant skill:

python3 utilities/skill-builder/scripts/skill_router.py \
  --query "set up Better Auth for my Next.js app" \
  --top-k 3 \
  --json

The router:

Scores by token overlap, path context, and explicit name mention
Emits confidence scores and human-readable rationale
Runs OpenClaw readiness + security checks on high-risk skills before routing
Appends routing telemetry to artifacts/skill-graphs/telemetry/route-events.jsonl
Respects kill-switch and rollout-mode control files

3. Skill quality gates (tiered, CI-enforced)

Every SKILL.md change runs two gate tiers:

Tier 1 — Structure gate (always runs):

Validates YAML frontmatter fields, category, description length
Compares against a baseline JSON; fails on regressions
Benchmarks portfolio coverage against benchmark-policy.json

Tier 2 — Eval baseline (on workflow_dispatch):

Runs run_skill_evals.py per skill with Codex and/or Claude-Kimi/ZAI in dual-run mode
Captures JSONL traces and builds a scorecard dashboard (artifacts/reports/skills/dashboard.json)

# Run locally (Tier 1 equivalent)
just diagnose

# Full quality suite
just validate

4. Recursive Skill Improvement Loop

The loop runs every Monday (or on demand) against a set of pilot profiles:

generate → evaluate → diagnose → improve → re-score

Each run writes canonical artifacts under artifacts/skill-graphs/runs/<run_id>/:

Artifact	What it records
`run.json`	Run metadata, status, stop reason
`iteration_journal.jsonl`	Per-iteration scores and rationale
`promotion_decision.json`	Whether the run meets promotion criteria
`lesson_candidates.json`	Generalizable lessons from this run

Human promotion gate: before any lesson or skill change is merged, a human operator runs scripts/human_promote_recursive_run.sh. The script validates the run ID, enforces approver-allowlist policy (docs/skill-graphs/governance/recursive-loop-approvers.yaml), checks the policy signature, and guards against path-traversal attacks.

# Dry-run: see what the genome loop would propose
just genome-loop

# Live run (stages candidates for review)
just genome-loop-live

# Review and approve a candidate
python3 scripts/review_candidates.py --list
python3 scripts/review_candidates.py --approve <candidate_id>

Safety controls:

Control	How to use
Kill-switch	`touch artifacts/skill-graphs/controls/kill-switch.txt`
Rollout mode	`echo active > artifacts/skill-graphs/controls/rollout-mode.txt`
Rollback required	`touch artifacts/skill-graphs/controls/rollback-required.txt`
Confidence gate	`composite_score ≥ 0.82`, `window_count ≥ 2`

# Test kill-switch and rollback behavior
just rollout-drill

# View router telemetry summary
just router-metrics

5. Visual-first outputs

# Convert any plan/doc → browser-native HTML visual
just smoke-slides

# Daily skill health spotlight
just spotlight

# Domain quality scoreboard (ui / backend / security / …)
just subject-scoreboard

Registered visual skills: visual-explainer (self-contained HTML pages), diagram-cli (Mermaid + context packs), slides (PPTX from markdown). If a table reaches 4+ rows or 3+ columns, render it as HTML — not ASCII.

Quickstart

# Show system status
just status

# Run all validations
just validate

# Count active skills
just count-skills

# Sync to all runtime directories (including MCP config projection)
just sync

# Run full CI bundle locally
just ci-local

# Create a skill from template
mkdir -p frontend/my-skill
cp templates/SKILL.md.template frontend/my-skill/SKILL.md

All commands

just --list                # All available recipes
just status                # System health overview
just validate              # Full validation suite
just diagnose              # Skill diagnostics (all skills)
just sync                  # Project skills + MCP config to runtimes
just genome-loop           # Dry-run improvement loop
just genome-loop-live      # Live improvement loop
just spotlight             # Daily health spotlight (one skill needing attention)
just subject-scoreboard    # Domain-level quality metrics
just rollout-drill         # Kill-switch + rollback resilience test
just router-metrics        # Routing telemetry analysis
just watch-readiness       # Agentation watch-mode readiness check
just smoke-slides          # Visual explainer smoke test
just docs-lint             # Documentation policy check
just harness-check         # coding-harness preflight gate (strict)
just install-cron          # Set up nightly genome loop cron

How it works

graph TD
    A["Skill authored in\ndomain folder"] -->|"just sync"| B["Symlinked to\nCodex / Claude / Gemini"]
    A -->|"PR opened"| C["CI quality gates\n12 workflows"]
    C --> D["Tier 1: structure gate\n+ benchmark check"]
    C --> E["Tier 2: dual-run evals\nCodex + Claude-Kimi"]
    C --> F["Security scan\nCodeQL + Semgrep + Trivy"]
    C --> G["Greptile AI review\ncode analysis"]
    A -->|"Monday 1am UTC"| H["Shadow cycle\nrun recursive loop"]
    H --> I["Telemetry JSONL\n+ failure candidates"]
    I -->|"genome loop"| J["Draft candidates\nconfidence ≥ 0.82"]
    J -->|"human reviews"| K["promotion gate\napprover allowlist + sig"]
    K -->|"approved"| L["Lesson merged\nto main"]
    C --> M["Recursive promotion gate\nvalidates promotion artifacts"]
    M -->|"high-risk files"| N["evidence-verify\nstage required"]

Skill structure

skill-name/
├── SKILL.md                   # Required: YAML frontmatter + instructions
├── references/
│   ├── evals.yaml             # Optional: eval cases for Tier 2 scoring
│   └── contract.yaml          # Optional: harness contract overrides
└── scripts/                   # Optional: supporting Python/shell scripts

Required YAML frontmatter

---
name: skill-name
description: "One-line description, max 80 chars"
metadata:
  category: frontend | backend | product | utilities | auth | ops | github
  tags: [tag1, tag2]
---

Learning posture (pilot)

Four skills support a structured pre-change dialogue:

Posture	What happens
`learn`	Agent explains alternatives, assumptions, and risks first
`guided`	Agent proposes concrete changes and waits for confirmation
`execute`	Agent applies agreed changes after safety gates pass

Pilot skills: skill-builder, agentation, systematic-debugging, interview-me

Repository layout

~/dev/agent-skills/
├── auth/               # Authentication skills
├── backend/            # Backend, architecture, CLI skills
├── frontend/           # Frontend + UI + graphics + tools
├── github/             # GitHub and DevOps workflow skills
├── interview/          # Interview and requirements workflows
├── ops/                # Deployment and operational skills
├── product/            # Planning, specs, docs skills
├── utilities/          # General-purpose skills + skill-builder tooling
│   └── skill-builder/
│       └── scripts/    # Router, quality gates, eval runner, dashboard
├── .agents/skills/     # Flat symlink view (agent entrypoint)
├── skills-antigravity/ # Antigravity-specific projection
├── skills-system/      # System skills (not in flat view)
├── scripts/            # Repo-level tooling (sync, genome loop, promote)
├── references/         # Shared contracts (evals.yaml, contract.yaml)
├── templates/          # SKILL.md and eval templates
├── artifacts/          # Generated outputs (benchmarks, telemetry, reports)
├── docs/               # Contributor documentation
│   └── skill-graphs/   # Loop governance, runbooks, pilot summaries
└── harness.contract.json  # Risk-tier and merge policy contract

Skill quality system

CI workflows (12)

Workflow	Trigger	What it enforces
`pr-pipeline`	Every PR	PR template, repo validate, harness preflight gate
`ci-tests`	Push to main + PR	Docs lint, skill diagnostics
`skill-quality`	SKILL.md changes	Tier-1 structure + benchmark; Tier-2 dual-run evals
`recursive-promotion-gate`	Promotion artifact changes	Validates promotion decisions, strict-runs check
`recursive-skill-shadow`	Monday 1am UTC + dispatch	Runs shadow cycle, uploads failure-pattern candidates
`benchmark-policy-refresh`	Monday 7am UTC + dispatch	Context7-backed threshold ratchet, auto-opens PR
`greptile-review`	Every PR	AI-assisted code review checks
`security-scan`	Every PR	Semgrep + Trivy CVE scanning
`codeql`	Push to main + PR	CodeQL static analysis (Python, TypeScript)
`secret-scan`	Every PR	Gitleaks secret detection
`docs-governance`	Docs/governance changes	Link integrity, policy conformance
`gov-security-gates`	Governance/compliance changes	Policy file integrity checks

harness.contract.json

The contract (v1.2.0) defines concrete policies applied on every PR:

Risk tiers: scripts/** and .github/workflows/** → high; **/SKILL.md → medium; README.md → low
Merge policy by tier: high requires review-gate + evidence-verify; medium requires review-gate
Diff budget: max 10 files, max 400 net LOC (overridable with diff-budget-override label)
Memory policy: sessions require repo, area, type tags; forbids credentials in stored observations
Branch protection: PRs to main, master, release/* are blocked by default

Governance and safety

Approver allowlist: docs/skill-graphs/governance/recursive-loop-approvers.yaml (signature-verified)
Branch protection: promotion scripts enforce confine_run_dir() — run directories must stay within artifacts/skill-graphs/runs/
Secret redaction: run_skill_genome_loop.py scrubs OpenAI keys, GitHub PATs, Slack tokens, SSH keys, AWS keys, JWTs, and IP addresses before writing any candidate
Kill-switch: one file write halts the genome loop immediately
OpenClaw guard: skill router runs readiness and security checks before routing to high-risk skills

Managed asset lifecycle

Phase-one managed asset governance keeps one lifecycle contract across:

canonical skills
packaged skills
plugin packages

Phase-one defaults:

authoritative lifecycle metadata stays in-file
Markdown-governed assets use canonical SKILL.md frontmatter
plugin packages use .codex-plugin/plugin.json
packaged skills inherit lifecycle metadata from the canonical source skill when a one-to-one mapping exists
docs/solutions/ entries need linked assets, concrete evidence, ownership context, and freshness markers

Reference:

Troubleshooting

Skill not found after sync

# Check for nested .git (most common cause)
just check-nested-git

# Re-run sync
just sync

# Diagnose specific skill
python3 scripts/diagnose_skill.py <skill-name>

# Check YAML frontmatter has both 'name:' and 'description:'
head -5 <skill-dir>/SKILL.md

Validation failures

# Docs lint
python3 scripts/docs_lint.py --mode warn --config docs-policy.json

# Plan graph validation
python3 ~/.codex/scripts/plan-graph-lint.py .agent/PLANS.md

# Skill router schema check
python3 scripts/verify_router_schema.py

Genome loop stuck or misbehaving

# Check active control files
ls artifacts/skill-graphs/controls/

# Check watermark (last processed offset)
cat artifacts/skill-graphs/telemetry/.genome-watermark

# Run rollout drill to confirm kill-switch works
just rollout-drill

Limits and constraints

Capability	Current state
Skill isolation	Per-folder (no sandboxing between skills)
Versioning	Repo-level only (no per-skill semver)
Language	English only
Sync	Local symlinks — use git for cross-machine distribution
Eval runner auth	Tier-2 evals require `codex` and/or `claude` CLI + auth on the runner

Documentation

Skills index — auto-generated list of the current surfaced skills with descriptions
Contributor docs — how to add, validate, and ship skills
Governed solutions — reusable fixes and decisions linked to governed assets
Skill Genome runbook — operating the improvement loop
Agent governance — security policy and audit trail

Governance

License: Apache 2.0 (LICENSE)
Contributing: CONTRIBUTING.md
Security: SECURITY.md
Code of Conduct: CODE_OF_CONDUCT.md

brAInwav — from demo to duty

Agent-first workflow

Create or update a plan in .agent/PLANS.md
Validate: python3 ~/.codex/scripts/plan-graph-lint.py .agent/PLANS.md
Verify: bash ~/.codex/scripts/verify-work.sh

Name		Name	Last commit message	Last commit date
Latest commit History 397 Commits
.agent		.agent
.codex		.codex
.diagram		.diagram
.github		.github
.greptile		.greptile
.harness		.harness
.playwright-mcp		.playwright-mcp
COMPLIANCE		COMPLIANCE
EVALUATION		EVALUATION
GOVERNANCE		GOVERNANCE
SECURITY		SECURITY
artifacts		artifacts
auth		auth
backend		backend
brand		brand
defusedxml		defusedxml
docs		docs
frontend		frontend
github		github
interview		interview
ops/metrics		ops/metrics
plugins		plugins
product		product
references		references
reports		reports
scripts		scripts
skills-system		skills-system
storage		storage
templates		templates
todos		todos
utilities		utilities
vaults/arscontexta		vaults/arscontexta
.architecture.yml		.architecture.yml
.diagramrc		.diagramrc
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
.memory-metrics.json		.memory-metrics.json
.mise.toml		.mise.toml
.npmrc		.npmrc
.pre-commit-config.yaml		.pre-commit-config.yaml
.qdrant-initialized		.qdrant-initialized
.vale.ini		.vale.ini
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
GEMINI.md		GEMINI.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
SECURITY.md		SECURITY.md
SKILL.md		SKILL.md
SUPPORT.md		SUPPORT.md
biome.json		biome.json
docs-policy.json		docs-policy.json
emit		emit
execute		execute
harness.contract.json		harness.contract.json
justfile		justfile
memory.json		memory.json
optional		optional
preflight		preflight
prek.toml		prek.toml
security_best_practices_report.md		security_best_practices_report.md
skills		skills
verify		verify
write		write

Folders and files

Latest commit

History

Repository files navigation

Agent Skills

Table of Contents

Why this exists

What you get

1. Cross-runtime skill library

2. Deterministic skill router

3. Skill quality gates (tiered, CI-enforced)

4. Recursive Skill Improvement Loop

5. Visual-first outputs

Quickstart

All commands

How it works

Skill structure

Required YAML frontmatter

Learning posture (pilot)

Repository layout

Skill quality system

CI workflows (12)

harness.contract.json

Governance and safety

Managed asset lifecycle

Troubleshooting

Skill not found after sync

Validation failures

Genome loop stuck or misbehaving

Limits and constraints

Documentation

Governance

Agent-first workflow

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages