The complete 7-phase pipeline specification for automated skill generation.
Adapted from CLI-Anything's HARNESS.md methodology for skill generation rather than CLI generation.
Three non-negotiable principles:
-
Use the real target — don't approximate. When analyzing a CLI tool, actually run
--help. When analyzing an API, fetch the real spec. Never guess at capabilities. -
Verify outputs programmatically. Don't assume a generated SKILL.md is correct because the generation succeeded. Validate structure, run trigger evals, benchmark against baselines.
-
Fail with clear messages. When analysis can't determine the target type, when a required dependency is missing, when packaging fails — provide actionable error messages that enable self-correction.
SkillAnything auto-detects the target type using these heuristics (in order):
- URL with OpenAPI/Swagger indicators → API
- Executable found in PATH → CLI tool
- Package name resolvable via pip/npm → Library
- File with step-by-step structure → Workflow
- URL with web interface → Service
- Free-text description → Ask user to clarify
When confidence < 0.7, fall back to interactive mode and ask the user.
API targets:
- Fetch OpenAPI spec (or scrape REST docs)
- Extract: endpoints, methods, auth patterns, request/response schemas, error codes
- Identify: CRUD patterns, resource relationships, pagination
- Priority: most-used endpoints, unique capabilities
CLI targets:
- Run:
<tool> --help,<tool> <subcommand> --help,man <tool> - Parse: subcommands, flags, options, output formats
- Identify: stateful vs stateless operations, piping patterns
- Priority: most common operations, unique capabilities
Library targets:
- Read: README, API docs, docstrings, type hints
- Parse: public functions, classes, configuration options
- Identify: initialization patterns, common workflows
- Priority: frequently imported functions, core capabilities
Workflow targets:
- Parse: step descriptions, tool references, data flow
- Identify: sequential vs parallel steps, decision points, error handling
- Map: inputs → transformations → outputs at each step
- Priority: critical path steps, error-prone steps
Service targets:
- Scrape: documentation, getting started guides, API reference
- Identify: authentication, core actions, webhooks/events
- Map: user journey → API calls → outcomes
- Priority: onboarding flow, most-used features
Input: Target identifier (URL, name, path, or description)
Output: analysis.json
{
"target_name": "jq",
"target_type": "cli",
"confidence": 0.95,
"capabilities": [
{"name": "filter", "description": "Filter JSON with expressions", "complexity": "core"},
{"name": "transform", "description": "Transform JSON structure", "complexity": "core"}
],
"inputs": ["JSON from stdin", "JSON files"],
"outputs": ["Filtered/transformed JSON"],
"auth_requirements": null,
"dependencies": ["jq >= 1.6"],
"error_patterns": ["parse error", "null output"],
"raw_help": "...",
"documentation_urls": []
}Script: scripts/analyze_target.py
Agent: agents/analyzer.md — handles ambiguous cases requiring intelligence
Input: analysis.json
Output: architecture.json
{
"skill_name": "jq-helper",
"skill_type": "tool-augmentation",
"structure": {
"skill_md_sections": ["overview", "usage", "examples", "advanced"],
"scripts": ["validate_jq.py"],
"references": ["jq-patterns.md"],
"assets": []
},
"triggers": ["jq", "json filter", "json transform", "parse json"],
"platforms": {
"claude-code": {"hooks": []},
"openclaw": {},
"codex": {"interface": {"display_name": "jq Helper"}}
}
}Script: scripts/design_skill.py
Agent: agents/designer.md — makes architectural decisions
Input: architecture.json + analysis.json
Output: Complete skill directory
The implementer generates:
SKILL.mdwith proper frontmatter and instructions (< 500 lines)- Helper scripts for deterministic operations
- Reference files for detailed documentation
- Configuration defaults
Script: scripts/init_skill.py (scaffolding) + agents/implementer.md (content)
Writing principles (from Anthropic skill-creator):
- Explain the WHY, not just the WHAT
- Use imperative form in instructions
- Avoid heavy-handed MUSTs — use theory of mind
- Keep SKILL.md lean with progressive disclosure
- Bundle repeated work as scripts
Input: analysis.json + generated skill
Output: evals/evals.json + trigger-evals.json
Generates two types of test cases:
- Functional evals (5-8 cases): Realistic user prompts with expected outcomes and assertions
- Trigger evals (20 cases): 10 should-trigger + 10 should-not-trigger for description optimization
Quality standards for trigger evals (from Anthropic):
- Realistic, detailed, with file paths and personal context
- Mix of lengths, formality levels, and phrasings
- Include abbreviations, typos, casual speech
- Focus on edge cases over clear-cut cases
- Should-not-trigger cases are near-misses, not obviously irrelevant
Script: scripts/generate_tests.py
Input: Generated skill + eval cases
Output: benchmark.json, grading.json per run, HTML viewer
Full evaluation flow:
- Spawn with-skill and baseline runs in parallel
- Grade each run with
agents/grader.md - Aggregate into benchmark statistics
- Run analyst pass with
agents/analyzer.md - Launch eval-viewer for user review
- Collect feedback
Scripts: run_eval.py, aggregate_benchmark.py, generate_report.py
Viewer: eval-viewer/generate_review.py
Input: Eval results + feedback Output: Optimized SKILL.md
Two optimization loops:
- Content optimization: Improve skill instructions based on functional eval feedback
- Description optimization: Improve triggering accuracy via train/test split loop
Description optimization process:
- Split trigger evals 60% train / 40% test
- Evaluate current description (3 runs per query)
- Call Claude with extended thinking to propose improvements
- Re-evaluate on train + test
- Iterate up to 5 times
- Select best by TEST score (not train) to prevent overfitting
Scripts: improve_description.py, run_loop.py
Agent: agents/optimizer.md
Input: Optimized skill
Output: Platform-specific packages in dist/
dist/
├── claude-code/<skill-name>/
│ ├── SKILL.md (with hooks in frontmatter)
│ └── scripts/, references/, assets/
├── openclaw/<skill-name>/
│ ├── SKILL.md
│ └── scripts/, references/, assets/
├── codex/<skill-name>/
│ ├── SKILL.md
│ ├── agents/openai.yaml
│ └── scripts/, references/, assets/
├── generic/<skill-name>.skill
└── manifest.json
Script: scripts/package_multiplatform.py
Agent: agents/packager.md
A well-generated skill should:
- Pass >80% of functional eval assertions
- Trigger correctly for >90% of trigger eval queries
- Have a description under 1024 characters
- Have SKILL.md under 500 lines
- Include at least 3 functional test cases
- Work on all enabled target platforms
- Properly attribute any bundled third-party content