feat: Skill Eval GitHub Action — test skills in CI by kmelve · Pull Request #29 · sanity-io/agent-toolkit

kmelve · 2026-03-04T01:00:47Z

Skill Eval GitHub Action

A reusable GitHub Action that tests agent skills in CI — run eval test cases against an LLM, grade responses against expectations, and catch regressions before merge.

What it does

Discovers skills in the repo (scans for SKILL.md files)
Loads test cases from tests/*.yml alongside each skill
Runs each test — sends the prompt to an LLM with the skill loaded as system context
Grades responses — binary pass/fail per expectation using LLM-as-judge (with retry on parse failure)
Reports results as a GitHub Actions job summary with pass rates, regressions, and cost estimates
Stores results in a Sanity dataset (optional) for baseline tracking and trend analysis
Fails the build if pass rate drops below threshold or regressions are detected

Usage

Basic (single provider)

name: Skill Eval
on:
  pull_request:
    paths: ['skills/**']

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5
      - uses: sanity-io/agent-toolkit/actions/skill-eval@v1
        with:
          provider: anthropic
          api-key: ${{ secrets.ANTHROPIC_API_KEY }}

Multi-model testing (matrix strategy)

jobs:
  eval:
    strategy:
      matrix:
        include:
          - provider: anthropic
            model: claude-sonnet-4-20250514
            key-secret: ANTHROPIC_API_KEY
          - provider: openai
            model: gpt-4o
            key-secret: OPENAI_API_KEY
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5
      - uses: sanity-io/agent-toolkit/actions/skill-eval@v1
        with:
          provider: ${{ matrix.provider }}
          model: ${{ matrix.model }}
          api-key: ${{ secrets[matrix.key-secret] }}

With Sanity baseline tracking

      - uses: sanity-io/agent-toolkit/actions/skill-eval@v1
        with:
          api-key: ${{ secrets.ANTHROPIC_API_KEY }}
          sanity-token: ${{ secrets.SANITY_TOKEN }}
          sanity-project-id: 'your-project-id'
          fail-on-regression: true

Test case format

Test cases live in tests/ alongside each skill as simple YAML:

skills/
  sanity-best-practices/
    SKILL.md
    references/
      schema.md
      groq.md
    tests/           ← NEW
      schema-advice.yml
      groq-query.yml

# tests/schema-advice.yml
prompt: "I need to create a blog post schema with title, author, and body content"
expectations:
  - "Uses defineType and defineField from sanity"
  - "Includes a slug field with source set to title"
  - "Uses reference type for author relationship"
  - "Uses array of block type for body/content field"
  - "Includes validation rules on required fields"
tags:
  - schema
  - core

Key decisions

Decision	Choice	Rationale
LLM provider	Vercel AI SDK (`ai` package)	Multi-provider support — test skills across Claude, GPT-4o, Gemini with one Action
Test format	YAML files in repo	Human-readable, reviewable in PRs, no build step, aligns with GH Actions conventions
Baseline storage	Sanity dataset (optional)	Queryable via GROQ, supports trend analysis, shared backend with Skills Studio
Grading	LLM-as-judge with retry	Binary pass/fail per expectation, retries once on JSON parse failure for CI reliability
Cost tracking	Token usage → estimated USD	Shown in job summary, helps teams budget CI costs

Inputs

Input	Required	Default	Description
`provider`	No	`anthropic`	AI provider (`anthropic`, `openai`, `google`)
`api-key`	Yes	—	API key for the chosen provider
`model`	No	`claude-sonnet-4-20250514`	Model for skill testing
`grader-model`	No	same as `model`	Model for grading (can differ)
`sanity-token`	No	—	Sanity token for result storage
`sanity-project-id`	No	—	Sanity project ID
`sanity-dataset`	No	`skill-evals`	Sanity dataset name
`skills-path`	No	`./skills`	Path to skills directory
`pass-threshold`	No	`0.8`	Minimum pass rate to pass
`fail-on-regression`	No	`true`	Fail on regression vs baseline
`max-evals-per-skill`	No	`20`	Max evals per skill (cost control)
`changed-only`	No	`true`	Only eval changed skills in PRs

Architecture

actions/skill-eval/
  action.yml          # Action metadata
  src/
    index.ts          # Entry point — orchestrates the flow
    loader.ts         # Discovers skills, loads SKILL.md + refs + tests
    runner.ts         # Vercel AI SDK — provider-agnostic generateText()
    grader.ts         # LLM-as-judge grading with retry
    reporter.ts       # Job summary generation + cost estimation
    baseline.ts       # Sanity dataset integration for baselines
    types.ts          # Shared TypeScript interfaces
  dist/
    index.js          # Bundled with @vercel/ncc (committed)
  package.json
  tsconfig.json

Open questions for the team

Should we add a skill-eval.yml workflow to this repo? We could dogfood the Action on the existing skills, but it needs an API key secret added to the repo.
Sanity dataset schema — The Action stores skillEvalResult documents. Should we deploy a schema for this, or let it be schemaless?
changed-only scope — Currently detects changed skills via git diff. If a shared reference file changes, only the skill containing it is re-evaluated. Should we expand to eval all skills when any reference changes?
Bundle size — The dist/index.js is ~3.1MB (Vercel AI SDK + provider adapters + Sanity client). This is within normal range for JS Actions but worth noting.

Sample test cases included

skills/sanity-best-practices/tests/schema-advice.yml — Tests schema creation advice (5 expectations)
skills/sanity-best-practices/tests/groq-query.yml — Tests GROQ query generation (5 expectations)

Relation to Skills Studio

This Action and the Skills Studio eval system are complementary:

GitHub Action — CI gate, runs on PR, catches regressions, multi-model testing
Studio Skill Lab — Interactive, run on demand, Anthropic model selection, results in Sanity
Both can share the same Sanity dataset for unified result tracking

socket-security · 2026-03-04T01:01:16Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	npm/@ai-sdk/openai@1.3.24
	npm/@ai-sdk/anthropic@1.2.12
	npm/@types/node@22.19.13
	npm/@sanity/client@7.16.0
	npm/@vercel/ncc@0.38.4
	npm/gray-matter@4.0.3
	npm/@ai-sdk/google@1.2.22
	npm/yaml@2.8.2
	npm/typescript@5.9.3
	npm/@actions/core@1.11.1
	npm/@actions/github@6.0.1
	npm/ai@4.3.19

View full report

socket-security · 2026-03-04T01:01:18Z

Warning

Review the following alerts detected in dependencies.

According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.

Action	Severity	Alert (click "▶" to expand/collapse)
Warn		License policy violation: npm `typescript` License: LicenseRef-W3C-Community-Final-Specification-Agreement - the applicable license policy does not allow this license (4) (package/ThirdPartyNoticeText.txt) From: actions/skill-eval/package-lock.json → `npm/typescript@5.9.3` ℹ Read more on: This package \| This alert \| What is a license policy violation? Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at `support@socket.dev`. Suggestion: Find a package that does not violate your license policy or adjust your policy to allow this package's license. Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment `@SocketSecurity ignore npm/typescript@5.9.3`. You can also ignore all packages with `@SocketSecurity ignore-all`. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.

View full report

feat: add skill-eval GitHub Action for testing skills in CI

1d6a1ec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Skill Eval GitHub Action — test skills in CI#29

feat: Skill Eval GitHub Action — test skills in CI#29
kmelve wants to merge 1 commit intomainfrom
feat/skill-eval-action

kmelve commented Mar 4, 2026

Uh oh!

socket-security bot commented Mar 4, 2026

Uh oh!

socket-security bot commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kmelve commented Mar 4, 2026

Skill Eval GitHub Action

What it does

Usage

Basic (single provider)

Multi-model testing (matrix strategy)

With Sanity baseline tracking

Test case format

Key decisions

Inputs

Architecture

Open questions for the team

Sample test cases included

Relation to Skills Studio

Uh oh!

socket-security bot commented Mar 4, 2026

Uh oh!

socket-security bot commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant