Skip to content

feat: add LRU validation cache with OpenTelemetry metrics#1936

Draft
nissessenap wants to merge 13 commits intosigstore:mainfrom
nissessenap:metrics
Draft

feat: add LRU validation cache with OpenTelemetry metrics#1936
nissessenap wants to merge 13 commits intosigstore:mainfrom
nissessenap:metrics

Conversation

@nissessenap
Copy link

@nissessenap nissessenap commented Feb 17, 2026

Summary

  • Add an opt-in LRU+TTL validation result cache (--enable-cache, --cache-size, --cache-ttl CLI flags) that caches successful policy validation results per image/CIP/resourceVersion, avoiding redundant signature verification on repeated admission requests
  • Add four OpenTelemetry metrics for cache observability: cache.operations (hit/miss counter), cache.writes (stored/skipped counter), cache.entries (observable gauge via cache.Len()), and cache.evictions (counter)
  • Bump knative.dev/pkg to pick up OTEL v1.39.0 transitive dependencies

Depends on #1933 and #1935

When the other PRs are merged I will give this one a second thought. The metrics name should probably differ a bit so we can add other metrics.

Details

The cache is disabled by default (NoCache no-op implementation). When enabled via --enable-cache=true, an LRUCache is created and injected into the context. Only successful validations (non-nil PolicyResult) are cached; failed validations are never cached to allow retries.

Cache metrics use the global OTEL MeterProvider set up by knative's sharedmain. The cache.entries gauge uses an observable callback reading cache.Len() at collection time, avoiding race conditions from manual increment/decrement bookkeeping. Each LRUCache instance owns its gauge registration lifecycle via a Close() method.

Test plan

  • Unit tests for cache behavior (hit/miss, TTL expiry, eviction, key isolation, resource version invalidation, skips-errors, partial success)
  • Unit tests for all four OTEL metrics using SDK ManualReader with exact value assertions
  • Integration tests validating cache interaction with ValidatePolicy flow
  • go test ./pkg/webhook/ passes
  • go vet ./pkg/webhook/ clean
  • Manual: deploy with --enable-cache=true and metrics-protocol: prometheus, verify /metrics on :9090 includes cache_operations_total, cache_writes_total, cache_entries, cache_evictions_total

🤖 Generated with Claude Code

nissessenap and others added 13 commits February 16, 2026 14:12
Add tests for the upcoming LRU+TTL cache implementation (sigstore#647).
Unit tests cover set/get, TTL expiry, eviction, key isolation,
error skipping, and resource version invalidation. Integration
tests verify ValidatePolicy cache hit/miss behavior.

All tests currently fail to compile (NewLRUCache undefined),
confirming the TDD Red phase.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Edvin Norling <edvin.norling@kognic.com>
Add LRUCache implementing ResultCache using hashicorp/golang-lru/v2
expirable. Only successful validations (PolicyResult non-nil) are
cached; failed validations are skipped to allow immediate retries.

Fix cache key mismatch bug: ref.Name() in Set vs ref.String() in Get
caused cache to never hit. Both now use ref.String().

Move cache Set into ValidatePolicy so caching is self-contained
regardless of call path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Edvin Norling <edvin.norling@kognic.com>
Wire the LRU+TTL cache into the validating webhook via --enable-cache,
--cache-size, and --cache-ttl flags. Cache is off by default and only
injected into the validating admission controller context.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Edvin Norling <edvin.norling@kognic.com>
- Copy CacheResult.Errors slice in LRUCache.Set to prevent callers from
  mutating cached entries through the shared backing array
- Update copyright year to 2026 on new files (lrucache.go, lrucache_test.go)
- Extract cacheTestFixtures helper to reduce boilerplate in cache
  integration tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Edvin Norling <edvin.norling@kognic.com>
    Move cache observability into LRUCache.Get() so the implementation
    owns its own logging
Bump knative.dev/pkg from v0.0.0-20230612155445 to
v0.0.0-20260213150858 to enable OTEL metrics support.

Remove stale replace directives for k8s.io/code-generator and
k8s.io/kube-openapi that were pinning old incompatible versions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add four OTEL metrics to track cache behavior: cache.operations
(hit/miss counter), cache.writes (stored/skipped counter),
cache.entries (observable gauge via cache.Len()), and cache.evictions
(counter). Uses the global OTEL MeterProvider set up by knative's
sharedmain, with race-free entries tracking via callback gauge.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lback accumulation

Each NewLRUCache call was registering a new observable gauge callback on
the global meter with no way to unregister, causing callbacks to
accumulate across test runs. Move gauge ownership into the LRUCache
struct with a Registration field and Close() method, eliminating the
global cacheEntriesLenFunc state entirely.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All metric instrument definitions now live together in metrics.go.
registerEntriesGauge is a plain function that returns metric.Registration,
and LRUCache still owns the lifecycle via its gaugeRegistration field
and Close() method.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments