Skip to content

Security: in-cluster communication audit findings (alpha hardening) #110

@Agent-Hellboy

Description

@Agent-Hellboy

Internal-communication security review of the platform as it would run in a cluster. Tracks hardening work; exploit specifics intentionally omitted from this public issue. If any item warrants coordinated disclosure, move that item to a private GitHub Security Advisory.

Static analysis only — items marked runtime check need cluster verification.

Critical

  • C1. Gateway can be bypassed; MCP servers trust any cluster pod. Upstream calls in services/mcp-proxy/main.go:166-167 use plain http.DefaultTransport; identity is forwarded as plain headers (services/mcp-proxy/main.go:765-784). Combined with C2, any pod can reach an MCP server directly with arbitrary identity headers. Fix: mTLS or HMAC-signed identity headers between gateway and upstream, plus NetworkPolicy.
  • C2. Zero NetworkPolicies in repo. find . -name '*.yaml' | xargs grep -l 'kind: NetworkPolicy' → 0 matches. Add default-deny + explicit allow per namespace (mcp-sentinel, mcp-runtime, mcp-servers, registry).
  • C3. Default header-mode auth has no verification. services/mcp-proxy/main.go:546-553 reads humanID/agentID/sessionID from headers without signature/MAC/JWT check. OAuth mode (policy.Auth.Mode == \"oauth\", line 657-688) does verify JWT. Recommend defaulting to OAuth in production policies and gating header mode behind an explicit dev flag.
  • C4. Only `tools/call` is policy-gated. services/mcp-proxy/main.go:827-829 calls policypkg.IsToolCallMethod, which matches only `tools/call` and `call_tool` (pkg/policy/helpers.go:9-16). `tools/list`, `prompts/`, `resources/`, `completion/complete` bypass grants entirely. At minimum extend gating to `tools/list` and the resources/prompts methods.
  • C5. Ingest fails open when both `API_KEYS` and `OIDC_JWKS_URL` are unset. services/ingest/main.go:247-250 returns next handler with no auth in that case. Flip to fail-closed and refuse to start when no auth source is configured.
  • C6. Gateway ClusterRole grants `get/list/watch` on `secrets` cluster-wide. k8s/10-gateway.yaml:402. Traefik does not need this. Drop `secrets` from the resources list.

High

  • H1. API ServiceAccount has cluster-wide Deployment CRUD + user/group impersonate. k8s/08-api-rbac.yaml:23,38-39. Reduce deployment verbs to `["patch"]`; remove the `impersonate` rule unless it is actively used (and if so, scope it).
  • H2. Plain HTTP everywhere; Kafka `PLAINTEXT`. No mTLS / Istio / Linkerd / SPIFFE config in repo. Adopt a service mesh or, at minimum, TLS for Kafka and the analytics path.
  • H3. Bundled Docker registry has no authentication. config/registry/base/ingress.yaml. Any pod can push images. Add htpasswd/OAuth in front of the registry, or restrict via NetworkPolicy + auth proxy. Runtime check for any prod overlay that already adds auth.
  • H4. Prometheus and Grafana exposed via gateway with no auth. k8s/10-gateway.yaml:525-538. k8s/02-secrets.yaml.example:29 ships `changeme` placeholder for Grafana admin. Gate both paths behind auth middleware and reject placeholder credentials at setup time.
  • H5. Shared API key model — no per-user isolation. API_KEYS is reused across UI/API/ingest/mcp-proxy; UI proxies upstream as a single shared key (services/ui/main.go:238-239). When `ADMIN_API_KEYS` is unset, all keys are admin (services/api/main.go:614). Move toward per-user/per-service credentials with attribution in audit logs.
  • H6. No replay/idempotency on tool calls. Add an `Idempotency-Key` (or JSON-RPC `id` dedup) for state-changing tool calls.

Medium

  • M1. API can `delete`/`deletecollection` namespaces and NetworkPolicies cluster-wide — k8s/08-api-rbac.yaml:30-35. Make namespaces read-only for the API SA.
  • M2. Operator has cluster-wide `secrets:get` with no `resourceNames` filter — config/rbac/role.yaml:29-31. Restrict to the specific secrets it manages.
  • M3. `PLATFORM_ADMIN_PASSWORD` may persist in API env after bootstrap. Setup renders it (around internal/cli/setup.go:2137) but no code clears it post-bootstrap; CLAUDE.md documents a manual `kubectl patch`. Automate the cleanup.
  • M4. Policy reload is on a 5s timer (services/mcp-proxy/main.go:451), so revocations take up to 5s to apply. Consider a watch-based reload or a shorter interval for incident response.
  • M5. Traefik PII redactor preserves `Authorization`, `X-Internal-Auth`, `X-Api-Key` (services/traefik-plugins/pii-redactor/redactor.go:29). Per CLAUDE.md the redactor is dev-overlay only — confirm prod registry ingress does not reference `pii-redactor@file` (it would break the Docker API).
  • M6. Verify JWT `exp` is enforced in mcp-proxy OAuth path (services/mcp-proxy/main.go:657-683). Runtime / parser-options check.
  • M7. All services bind `0.0.0.0`. Acceptable with NetworkPolicy (C2); for sidecar-only services, prefer `127.0.0.1`.
  • M8. `/health` on API leaks `runtime_initialized` and `runtime_error` strings — services/api/main.go:207-220. Reduce to a static OK.

Low

  • L1. `/metrics` endpoints are unauthenticated on dedicated ports (acceptable only with NetworkPolicy).
  • L2. `x-forwarded-for` used as rate-limit key without trusted-proxy validation — services/ui/main.go:569-579.
  • L3. Example secrets file ships `change-me-now` / `changeme` placeholders; no setup-time guard rejects them.
  • L4. `automountServiceAccountToken: true` left default on Sentinel deployments.

Suggested fix order

  1. Default-deny NetworkPolicy per namespace + explicit allow rules (closes most of the multiplier behind C1, H2, M7, L1).
  2. Make ingest auth fail-closed (C5).
  3. Extend the policy gate beyond `tools/call` (C4).
  4. Tighten gateway and API RBAC (C6, H1, M1, M2).
  5. Sign or mTLS the gateway→upstream identity hop (C1).

Notes

  • Static analysis on `main`. Findings reference exact file:line at time of audit; please re-verify against current HEAD when implementing.
  • Repo is marked alpha (per CLAUDE.md); this issue is sanitized for public tracking. If any item should be treated as a coordinated-disclosure vulnerability, open a private GitHub Security Advisory and link it here without exploit detail.

Metadata

Metadata

Assignees

No one assigned

    Labels

    securitySecurity vulnerabilities, hardening, threat-model concerns

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions