Skip to content

S3 archival: support EKS Pod Identity alongside IRSA #985

@talsuk5

Description

@talsuk5

Problem

The S3 archival feature currently only supports IRSA (IAM Roles for Service Accounts) on EKS. When spec.archival.provider.s3.roleName is set, the operator annotates all temporal service accounts with eks.amazonaws.com/role-arn, which triggers AssumeRoleWithWebIdentity via the IRSA webhook.

This doesn't work on clusters using EKS Pod Identity (the newer, recommended mechanism), because:

  1. The validating webhook requires roleName or credentials — there's no way to opt out and let the default credential chain handle it (which is how Pod Identity works).
  2. Even if the IAM role exists with a Pod Identity trust policy (pods.eks.amazonaws.com), the operator forces IRSA by adding the SA annotation, and the Temporal server calls AssumeRoleWithWebIdentity which fails with AccessDenied.

Observed behavior

With Pod Identity configured and roleName set:

operation: RegisterNamespace
error: WebIdentityErr: failed to retrieve credentials
caused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity

This error is returned by the Temporal frontend during RegisterNamespace (since it validates archival), but the operator surfaces it as context deadline exceeded — making it very hard to diagnose.

Expected behavior

The operator should support EKS Pod Identity for S3 archival. Possible approaches:

  1. Make roleName optional in the webhook validation for S3 on EKS — allow users to rely on the default AWS credential chain (Pod Identity, instance profile, etc.) without requiring roleName or credentials.
  2. Add a useDefaultCredentials: true option (or similar) to explicitly opt into the default credential chain.

Additional notes

  • The operator adds the eks.amazonaws.com/role-arn annotation to all service accounts (frontend, history, matching, worker), not just history. This means all pods attempt IRSA auth, even though only history and frontend need S3 access for archival.
  • The context deadline exceeded error from the namespace controller obscures the real failure (AccessDenied on AssumeRoleWithWebIdentity). Better error propagation from the frontend's RegisterNamespace response would help debugging.

Environment

  • Operator version: v0.22.0
  • Temporal server: 1.28.0
  • EKS Auto Mode with Pod Identity
  • Kubernetes: 1.35

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions