Skip to content

feat: Skill Eval GitHub Action — test skills in CI#29

Draft
kmelve wants to merge 1 commit intomainfrom
feat/skill-eval-action
Draft

feat: Skill Eval GitHub Action — test skills in CI#29
kmelve wants to merge 1 commit intomainfrom
feat/skill-eval-action

Conversation

@kmelve
Copy link
Copy Markdown
Member

@kmelve kmelve commented Mar 4, 2026

Skill Eval GitHub Action

A reusable GitHub Action that tests agent skills in CI — run eval test cases against an LLM, grade responses against expectations, and catch regressions before merge.

What it does

  1. Discovers skills in the repo (scans for SKILL.md files)
  2. Loads test cases from tests/*.yml alongside each skill
  3. Runs each test — sends the prompt to an LLM with the skill loaded as system context
  4. Grades responses — binary pass/fail per expectation using LLM-as-judge (with retry on parse failure)
  5. Reports results as a GitHub Actions job summary with pass rates, regressions, and cost estimates
  6. Stores results in a Sanity dataset (optional) for baseline tracking and trend analysis
  7. Fails the build if pass rate drops below threshold or regressions are detected

Usage

Basic (single provider)

name: Skill Eval
on:
  pull_request:
    paths: ['skills/**']

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5
      - uses: sanity-io/agent-toolkit/actions/skill-eval@v1
        with:
          provider: anthropic
          api-key: ${{ secrets.ANTHROPIC_API_KEY }}

Multi-model testing (matrix strategy)

jobs:
  eval:
    strategy:
      matrix:
        include:
          - provider: anthropic
            model: claude-sonnet-4-20250514
            key-secret: ANTHROPIC_API_KEY
          - provider: openai
            model: gpt-4o
            key-secret: OPENAI_API_KEY
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5
      - uses: sanity-io/agent-toolkit/actions/skill-eval@v1
        with:
          provider: ${{ matrix.provider }}
          model: ${{ matrix.model }}
          api-key: ${{ secrets[matrix.key-secret] }}

With Sanity baseline tracking

      - uses: sanity-io/agent-toolkit/actions/skill-eval@v1
        with:
          api-key: ${{ secrets.ANTHROPIC_API_KEY }}
          sanity-token: ${{ secrets.SANITY_TOKEN }}
          sanity-project-id: 'your-project-id'
          fail-on-regression: true

Test case format

Test cases live in tests/ alongside each skill as simple YAML:

skills/
  sanity-best-practices/
    SKILL.md
    references/
      schema.md
      groq.md
    tests/           ← NEW
      schema-advice.yml
      groq-query.yml
# tests/schema-advice.yml
prompt: "I need to create a blog post schema with title, author, and body content"
expectations:
  - "Uses defineType and defineField from sanity"
  - "Includes a slug field with source set to title"
  - "Uses reference type for author relationship"
  - "Uses array of block type for body/content field"
  - "Includes validation rules on required fields"
tags:
  - schema
  - core

Key decisions

Decision Choice Rationale
LLM provider Vercel AI SDK (ai package) Multi-provider support — test skills across Claude, GPT-4o, Gemini with one Action
Test format YAML files in repo Human-readable, reviewable in PRs, no build step, aligns with GH Actions conventions
Baseline storage Sanity dataset (optional) Queryable via GROQ, supports trend analysis, shared backend with Skills Studio
Grading LLM-as-judge with retry Binary pass/fail per expectation, retries once on JSON parse failure for CI reliability
Cost tracking Token usage → estimated USD Shown in job summary, helps teams budget CI costs

Inputs

Input Required Default Description
provider No anthropic AI provider (anthropic, openai, google)
api-key Yes API key for the chosen provider
model No claude-sonnet-4-20250514 Model for skill testing
grader-model No same as model Model for grading (can differ)
sanity-token No Sanity token for result storage
sanity-project-id No Sanity project ID
sanity-dataset No skill-evals Sanity dataset name
skills-path No ./skills Path to skills directory
pass-threshold No 0.8 Minimum pass rate to pass
fail-on-regression No true Fail on regression vs baseline
max-evals-per-skill No 20 Max evals per skill (cost control)
changed-only No true Only eval changed skills in PRs

Architecture

actions/skill-eval/
  action.yml          # Action metadata
  src/
    index.ts          # Entry point — orchestrates the flow
    loader.ts         # Discovers skills, loads SKILL.md + refs + tests
    runner.ts         # Vercel AI SDK — provider-agnostic generateText()
    grader.ts         # LLM-as-judge grading with retry
    reporter.ts       # Job summary generation + cost estimation
    baseline.ts       # Sanity dataset integration for baselines
    types.ts          # Shared TypeScript interfaces
  dist/
    index.js          # Bundled with @vercel/ncc (committed)
  package.json
  tsconfig.json

Open questions for the team

  1. Should we add a skill-eval.yml workflow to this repo? We could dogfood the Action on the existing skills, but it needs an API key secret added to the repo.

  2. Sanity dataset schema — The Action stores skillEvalResult documents. Should we deploy a schema for this, or let it be schemaless?

  3. changed-only scope — Currently detects changed skills via git diff. If a shared reference file changes, only the skill containing it is re-evaluated. Should we expand to eval all skills when any reference changes?

  4. Bundle size — The dist/index.js is ~3.1MB (Vercel AI SDK + provider adapters + Sanity client). This is within normal range for JS Actions but worth noting.

Sample test cases included

  • skills/sanity-best-practices/tests/schema-advice.yml — Tests schema creation advice (5 expectations)
  • skills/sanity-best-practices/tests/groq-query.yml — Tests GROQ query generation (5 expectations)

Relation to Skills Studio

This Action and the Skills Studio eval system are complementary:

  • GitHub Action — CI gate, runs on PR, catches regressions, multi-model testing
  • Studio Skill Lab — Interactive, run on demand, Anthropic model selection, results in Sanity
  • Both can share the same Sanity dataset for unified result tracking

@socket-security
Copy link
Copy Markdown

Warning

Review the following alerts detected in dependencies.

According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.

Action Severity Alert  (click "▶" to expand/collapse)
Warn High
License policy violation: npm typescript

License: LicenseRef-W3C-Community-Final-Specification-Agreement - the applicable license policy does not allow this license (4) (package/ThirdPartyNoticeText.txt)

From: actions/skill-eval/package-lock.jsonnpm/typescript@5.9.3

ℹ Read more on: This package | This alert | What is a license policy violation?

Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at support@socket.dev.

Suggestion: Find a package that does not violate your license policy or adjust your policy to allow this package's license.

Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment @SocketSecurity ignore npm/typescript@5.9.3. You can also ignore all packages with @SocketSecurity ignore-all. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.

View full report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant