Skip to content

Conversation

@jhrozek
Copy link
Contributor

@jhrozek jhrozek commented Oct 28, 2025

Problem

Many backend services use static authentication (API keys, tokens) instead of OAuth/OIDC. Our token exchange middleware works great for OAuth-compatible backends, but we can't handle scenarios where users need their own individual credentials.

Right now, pkg/secrets resolves secrets once at workload startup and injects them as environment variables. This means secrets are shared across all users - there's no way to isolate credentials per user or track which user accessed what.

Use Cases

Per-User Algolia Admin Keys

When multiple developers manage Algolia search indices through an MCP server, all modifications currently appear to come from one shared account. There's no way to audit who made which changes or enforce individual permissions.

With secret injection, each user's admin key is stored in Vault under their identity. When Alice modifies an index, the middleware authenticates to Vault as Alice and retrieves her specific key. Now we get proper attribution and a complete audit trail.

Multi-Tenant SaaS

Running a single MCP server for multiple customers breaks down when you need tenant-specific credentials. One shared API key for Jira or Salesforce means all tenant data gets accessed through a single account - no isolation.

The solution is storing each tenant's API key in Vault (like secret/tenants/acme-corp/jira-api-key). The user's JWT contains their tenant claim, and the middleware fetches the right credential for that tenant.

Centralized Secret Management

Static secrets in environment variables don't meet enterprise security requirements. Security teams want centralized policies, rotation, and audit logs.

This middleware lets us store API keys only in Vault with proper access controls. The proxy fetches them dynamically, and every access is logged with the user's identity.

How It Works

The middleware sits in the proxy chain after token exchange:

  1. Uses the JWT token to authenticate to Vault
  2. Fetches the user-specific secret from the configured path
  3. Injects it into request headers before forwarding to the backend

In Phase 1, we support static paths. Phase 2 adds Go templating so paths can include user claims like secret/users/{{.email}}/api-key.

Design Decisions

We're using a generic SecretFetcher interface with Vault as the first implementation. This keeps the door open for AWS Secrets Manager or Azure Key Vault later.

The Vault integration uses JWT authentication so each user's identity flows through to secret access. We're implementing the Vault client with direct HTTP calls instead of importing their SDK to avoid BSL 1.1 licensing issues.

Two-phase delivery: static paths first to prove the pattern, then templating for per-user isolation. This gets us something useful quickly while reducing risk.

This is complementary to pkg/secrets, not a replacement. Use pkg/secrets for shared startup credentials like database URLs. Use secret injection for per-user request-time credentials like personal API keys.

What's In The Proposal

  • Detailed Vault setup with security controls (bound_issuer, bound_audiences, proper TTLs)
  • CLI flag design and Kubernetes CRD examples
  • Comparison table showing how this differs from pkg/secrets
  • Testing strategy for unit, integration, and operator tests
  • Sequence diagrams showing the request flow

Refs: THV-2063 (token exchange middleware)

Introduces a design proposal for dynamic secret fetching and injection
into MCP proxy requests using HashiCorp Vault or other secret providers.

This proposal addresses the need for per-user credential isolation when
backend services use static authentication (API keys, tokens) rather than
OAuth/OIDC. While ToolHive's existing token exchange middleware handles
OAuth-compatible backends, many legacy APIs and SaaS tools require static
credentials that should be:

- Stored securely in centralized secret managers (Vault, AWS Secrets Manager)
- Isolated per user or tenant for proper attribution and audit trails
- Fetched dynamically at request time based on user identity

Key design elements:
- Generic SecretFetcher interface with Vault as primary implementation
- Uses JWT authentication to Vault for per-user secret access
- Integrates with existing middleware chain after token exchange
- HTTP-only Vault client to avoid BSL 1.1 licensing concerns
- Phased delivery: static paths (Phase 1), Go templating (Phase 2)
- Supports both CLI flags and Kubernetes CRD configuration
- Complementary to existing pkg/secrets (startup-time workload secrets)

The proposal includes:
- Three concrete use cases (Algolia admin keys, multi-tenant SaaS, secure storage)
- Detailed Vault setup instructions with security controls
- Comparison with existing pkg/secrets system
- Testing strategy for unit, integration, and operator tests

Refs: THV-2063 (token exchange middleware)
@jhrozek jhrozek marked this pull request as draft October 28, 2025 21:43
@codecov
Copy link

codecov bot commented Oct 28, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 54.25%. Comparing base (5805898) to head (6b450c5).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2357      +/-   ##
==========================================
- Coverage   54.27%   54.25%   -0.02%     
==========================================
  Files         242      242              
  Lines       23446    23446              
==========================================
- Hits        12725    12721       -4     
- Misses       9506     9516      +10     
+ Partials     1215     1209       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


### 4. Path Handling for KV v2

Users must provide the full API path including `/data/` for KV v2 secrets. For example, to access a secret at logical path `users/alice/pat` in a KV v2 engine mounted at `secret/`, users specify `secret/data/users/alice/pat` in the configuration. This matches the HTTP API endpoint structure (`GET /v1/secret/data/users/alice/pat`) and aligns with how other Vault integrations (CSI driver, Vault Agent) handle paths. The middleware does not auto-inject `/data/` because: (a) it uses the HTTP API directly where `/data/` is part of the endpoint, (b) this prevents ambiguity and double-injection bugs, and (c) it provides explicit clarity for Phase 2 templating where paths like `secret/data/users/{{.email}}/api-key` show the exact API structure.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while templating makes sense in this scenario. How will validation be handled? Do we fully trust what comes from the claims or what's put in the templates?


**Environment variable fallback strategy**: When a Vault option is not explicitly provided via `--secret-provider-opts`, the code falls back to standard Vault environment variables (`VAULT_ADDR`, `VAULT_NAMESPACE`, `VAULT_CACERT`). This maintains compatibility with existing Vault workflows and tooling.

**Option parsing**: The `--secret-provider-opts` flag accepts both comma-separated key=value pairs and multiple flag invocations. A custom `pflag.Value` implementation parses the options into a map, which is then validated against the provider's schema to ensure all required options are present (either from flags or environment variables).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would we want to do something like a gRPC interface so we could potentially plug in any sort of provider?


**Performance Considerations**: Phase 1 adds approximately 150ms latency per request (100ms for JWT auth + 50ms for secret fetch) according to typical Vault performance characteristics; Phase 2's optional caching can reduce this to <1ms for cache hits, making it suitable for high-throughput production deployments.

**Open Questions**: We need to decide whether to auto-detect KV v1 vs v2 and inject `/data/` automatically (with logging) or require users to provide full paths; additionally, we should determine if Phase 2 caching should cache Vault tokens (reducing Vault load) or secrets (reducing overall latency).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just support one of these?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants