Extracts JSON schemas from Kubernetes CustomResourceDefinitions (CRDs) published by upstream operators and controllers.
The extractor reads source configuration files, fetches CRDs from Helm charts (HTTP and OCI), raw URLs, or GitHub repository archives, extracts the openAPIV3Schema from each served CRD version, and writes standalone JSON Schema files. Each schema includes provenance metadata and a CycloneDX SBOM for the full run.
- Nix (recommended) -- provides Go and all tooling via
nix develop - Or manually: Go 1.25+
No external tools required at runtime. Helm chart fetching (both HTTP repos and OCI registries), GitHub archive downloads, and tarball extraction are all implemented in pure Go.
# Enter the dev shell (provides go, check-jsonschema, yq, goreleaser)
nix develop
# Run all tests
go test ./...
# Build via Nix
nix build
./result/bin/crd-schema-extractor versionThe CLI accepts a path argument (directory of YAML files or a single file) and defaults to sources/ when omitted.
# Extract schemas from all configured sources (default: sources/ -> schemas/)
crd-schema-extractor
# Extract from a specific directory or file
crd-schema-extractor extract sources/
crd-schema-extractor extract sources/cert-manager.io.yaml
# Extract a single source with debug logging
crd-schema-extractor extract sources/cert-manager.io.yaml --debug
# Custom output directory
crd-schema-extractor extract sources/ -o /tmp/schemas
# Parallel fetching (default: 4 concurrent sources)
crd-schema-extractor extract sources/ -p 8
# Fetch only (download charts/manifests to output dir, skip extraction)
crd-schema-extractor extract sources/cert-manager.io.yaml --fetch-only
# Validate source config files
crd-schema-extractor validate sources/
crd-schema-extractor validate sources/cert-manager.io.yaml
# Print version
crd-schema-extractor versionWhen running from source with go run:
go run ./cmd/crd-schema-extractor/ extract sources/cert-manager.io.yaml --debug
go run ./cmd/crd-schema-extractor/ validate sources/- Source configs in
sources/*.yamldeclare upstream CRD locations (Helm chart, URL, or Git repository) - The extractor fetches sources in parallel (configurable via
--parallel), scans for CRD documents, and extracts theopenAPIV3Schemafrom every served version - Include/exclude filters narrow down which CRDs are kept
- Cross-source conflict detection catches the same group/kind/version with different content
- Identical duplicates are deduplicated (first occurrence wins)
- Schemas are written to
schemas/{group}/{apiVersion}/{kind}.jsonwith SHA-256 change detection - Provenance metadata (
.provenance.json) and a CycloneDX SBOM (sbom.cdx.json) are generated alongside
Helm charts are scanned across all directories -- crds/, templates/ (with Go template directive stripping), and any non-standard locations. This handles charts that place CRDs in different paths without requiring helm template at processing time.
Git repository sources download a tag archive from GitHub and recursively scan for CRD YAML files. An optional path field restricts scanning to a specific subdirectory for performance on large repos.
Create or edit a YAML file in sources/ named after the primary API group (e.g., cert-manager.io.yaml):
# Helm chart source
sources:
- name: cert-manager
type: helm
repo: https://charts.jetstack.io
chart: cert-manager
version: v1.17.2
license: Apache-2.0
homepage: https://cert-manager.io
values: # optional: helm --set key=value pairs
crds.enabled: "true"
include: # optional: allowlist (Kind, group/Kind, group/*)
- "cert-manager.io/*"
exclude: # optional: denylist (same syntax)
- SomeKind# Git repository source (for projects with CRDs committed in non-standard locations)
sources:
- name: crossplane
type: git+github
repo: https://github.com/crossplane/crossplane
version: v1.17.2
license: Apache-2.0
homepage: https://crossplane.io
path: cluster/crds # optional: restrict scan to subdirectory
include:
- "apiextensions.crossplane.io/*"
- "pkg.crossplane.io/*"Source types:
| Type | Required fields | Description |
|---|---|---|
helm |
repo, chart |
Helm chart from HTTP or OCI (oci:// prefix) repository |
url |
url |
Direct HTTP URL to a YAML manifest containing CRDs |
git+github |
repo |
GitHub repository archive at a specific tag. Supports optional path to restrict scanning. Uses GITHUB_TOKEN env var for authenticated requests |
Source configs can be validated with the built-in command or against source.schema.json:
crd-schema-extractor validate sources/
check-jsonschema --schemafile source.schema.json sources/*.yamlcmd/crd-schema-extractor/
main.go CLI entrypoint (cobra root command, version subcommand)
extract.go extract subcommand (parallel fetch + extract + write)
validate.go validate subcommand (source config validation)
internal/
source/source.go Source config parsing (directory or single file)
fetcher/
fetcher.go Fetcher interface, Result type, factory
helm_http.go Pure Go HTTP Helm repo fetcher (index.yaml + tarball)
helm_oci.go OCI registry fetcher using oras-go
url.go URL fetcher with retry
git.go GitHub archive fetcher (tarball download + auth)
untar.go Tarball extraction utilities
extractor/
extractor.go Extract() pipeline entry point, CRD parser
process.go Chart scanning, template stripping dispatch
strip.go Go template directive removal
filter.go Include/exclude filtering
conflict.go Cross-source conflict detection, dedup
provenance/provenance.go Per-schema provenance metadata
sbom/sbom.go CycloneDX 1.5 SBOM generation
sources/ Source config files ({api-group}.yaml)
schemas/ Output directory
source.schema.json JSON Schema for source configs (included in releases)
flake.nix Nix flake (build + dev shell)
.goreleaser.yaml Release configuration
Three GitHub Actions workflows:
- test.yml (on PR): runs
go vet,go test -race, and an E2E test that extracts schemas from cert-manager (helm) and Crossplane (git+github) to verify the full pipeline - release-drafter.yml (on push to main): auto-maintains a draft GitHub release with changelog from merged PRs
- release.yaml (on tag push): runs goreleaser to build binaries for linux/darwin (amd64/arm64), attach them to the release along with
source.schema.jsonand checksums
Release flow: PRs merge to main, release-drafter updates the draft. When the draft is published, the tag is created, triggering goreleaser to build and attach artifacts.
Renovate is configured to open weekly PRs bumping Helm chart versions and GitHub release URL versions in the source configs.
See individual source license fields for upstream CRD licensing. The extractor tool itself is available under the terms specified in the repository.