Skip to content

Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover after 1 hr with IRSA #8173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dharanui opened this issue Sep 1, 2024 · 25 comments · May be fixed by #8970

Comments

@dharanui
Copy link

dharanui commented Sep 1, 2024

velero version: 1.14.1
error: async write error: "unable to write content chunk 96 of FILE:000002: mutable parameters: unable to read format blob: error getting kopia.repository blob: The provided token has expired: mutable parameters: unable to read format blob: error getting kopia.repository blob: The provided token has expired"

The datauploads are failing after almost one hour of running.
Tried also to incraese repo maintainence frequency , but no luck

@dharanui dharanui changed the title from velero version 1.14 we are getting error of expired toekn for backuing up data using datamover from velero version 1.14 we are getting error of expired token for backuing up data using datamover Sep 1, 2024
@Lyndon-Li
Copy link
Contributor

Looks like the token to access object store has expired.

@dharanui
Copy link
Author

dharanui commented Sep 2, 2024

its expires every one hour? because datauploads which takes less than an hour runs and completes.. the ones which take longer are getting cancelled. In the logs of node-agent we see this error at that time

@Lyndon-Li
Copy link
Contributor

The expiration time of the token is not set by Velero, so you need to check how the token was created.

@dharanui
Copy link
Author

dharanui commented Sep 2, 2024

but we were not getting this issue in 1.12

@dharanui dharanui changed the title from velero version 1.14 we are getting error of expired token for backuing up data using datamover Since velero version 1.14 we are getting error of expired token for backuing up data using datamover Sep 2, 2024
@dharanui
Copy link
Author

dharanui commented Sep 2, 2024

We use IRSA and I see iam token valid for 24h.

volumes:
  - name: aws-iam-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: sts.amazonaws.com
          expirationSeconds: 86400
          path: token

Looks like this commit seems to be relevant: https://github.com/vmware-tanzu/velero/pull/7374/files ??

@dharanui dharanui changed the title Since velero version 1.14 we are getting error of expired token for backuing up data using datamover Since velero version 1.14 we are getting error of expired token for backing up data using datamover Sep 2, 2024
@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Sep 4, 2024

We use IRSA and I see iam token valid for 24h.

volumes:
  - name: aws-iam-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: sts.amazonaws.com
          expirationSeconds: 86400
          path: token

Looks like this commit seems to be relevant: https://github.com/vmware-tanzu/velero/pull/7374/files ??

Why that commit is related? Have you specified BSL->credentialFile?

@dharanui
Copy link
Author

dharanui commented Sep 4, 2024

oops sorry , no we dont use credentialFile.
It is not giving that error now when we rolled back to 1.12.
Could it be that the repository maintainance job is recreating the token or something on that lines?

Or may be kopia version changes with velero upgrade?

@Lyndon-Li
Copy link
Contributor

Neither Velero nor Kopia could change the token being used, I guess there might be another token specified. We also have test cases for IRSA, but we didn't see the problem as here.

@SCLogo
Copy link

SCLogo commented Sep 25, 2024

issue happens with velero 1.13.2 with datamover.

@dharanui
Copy link
Author

dharanui commented Sep 25, 2024

@Lyndon-Li this was working fine until 1.12 and started happening since upgrade to 1.13 also 1.14. Do we know what has changed since 1.12? This is blocking us from upgrading to 1.14 currently

@catalinpan
Copy link

As mentioned above I'm getting the same error for restores which are longer than 1h. The restore will fail based on the fsBackupTimeout so the error is not detected by the restore process.

I'm using below images with IAM role and IRSA:

On the restore-wait init container this message shows up in a loop

The filesystem restore done file /restores/data/.velero/file123 is not found yet. Retry later.

In the node-agent this message show up

time="2024-10-02T23:20:34Z" level=error msg="Async fs restore data path failed" controller=PodVolumeRestore error="Failed to run kopia restore: Failed to copy snapshot data to the target: restore error: copy file: error creating file: cannot write data to file %q /host_pods/a2e48cae-8c75-4971-abb0-cbadb80674c8/volumes/kubernetes.io~csi/pvc-d38b075b-f1f3-4c59-8384-15f9d25fa782/mount/export/2024-Jul-12--0100.zip: unexpected content error: error getting cached content from blob \"pb3f655d8f0c66aa9377a3d660c143a45-s83fa0c09e23487b612d\": failed to get blob with ID pb3f655d8f0c66aa9377a3d660c143a45-s83fa0c09e23487b612d: The provided token has expired" logSource="pkg/controller/pod_volume_restore_controller.go:332" pvr=pvc-20241002183056-20241002221832khkt

The restore worked without any issues when downgraded to below versions

  • velero:v1.12.4
  • velero/velero-plugin-for-aws:v1.8.0
  • velero/velero-restore-helper:v1.10.2

Hope this will help a bit.

@dharanui
Copy link
Author

dharanui commented Oct 4, 2024

Thanks @catalinpan .
We are using CSI snapshot (https://velero.io/docs/main/csi-snapshot-data-movement/) instead of fsb.
For us the backup itself is failing if beyond one hour for velero v1.14.1. Downgrading to 1.12 made the backups work.

Is there any workaround to make this work in 1.14?

@dharanui dharanui changed the title Since velero version 1.14 we are getting error of expired token for backing up data using datamover Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover Oct 5, 2024
@dharanui dharanui changed the title Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover after 1 hr Oct 7, 2024
@dharanui dharanui changed the title Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover after 1 hr Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover after 1 hr with IRSA Oct 7, 2024
@SCLogo
Copy link

SCLogo commented Oct 8, 2024

we use velero 1.14.1
aws plugin: 1.10.1
with kube2iam 3600s tokens
Csi backup with data mover (kopia)
What I probably see in logs, that velero or kopia does not request new token when it expired just goes failed (canceled)
Kopia requests aws token using kube2iam @ 12:03:46 . It starts the upload and finishes. 1 hour later (we use hourly backups), another dataupload request created (2024-10-08T14:02:49Z)for the same resources and it exit with token has expired error (2024-10-08T14:02:55Z) and new token requested (14:03:25).
can it somehow set that before it goes failed with expired token error, just try to request a new token ?

@Lyndon-Li
Copy link
Contributor

This may be the expected behavior for now, multiple DUs may be created at the same time but processed one by one. If the 1st DU takes more than 1 hour, the second one's token will timeout.
The data mover pod doesn't support IRSA, this may be the cause.

@SCLogo
Copy link

SCLogo commented Oct 9, 2024

those are two different backups. 1st finish w/o issue. Second starts and DU created eariler then last run so it gets the old key that expire soon and DU goes cancelled. If Velero would try to get a new key before exit with error, this problem could not come up. If I reduce the duration of the key I can just hide the issue, but once DU needs more time than I set,then I need to set higher duration.

@SCLogo
Copy link

SCLogo commented Oct 9, 2024

the default duration for iam role is 1 hour. we use that one

@dharanui
Copy link
Author

Is increasing the default duration helping in this case? @SCLogo

@SCLogo
Copy link

SCLogo commented Nov 19, 2024 via email

@dharanui
Copy link
Author

dharanui commented Dec 18, 2024

Hi @SCLogo / @Lyndon-Li , can you help me how to override DurationSeconds while velero is performing assumeRole ? I am using IRSA. Updation maxSessionDuration on role is not helping because default duration while assuming role is 1hr.
https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html

according to aws/aws-cli#9021 there is no environmental variable for that currently.

@SCLogo
Copy link

SCLogo commented Dec 21, 2024

@dharanui . I am using kube2iam. As the default max duration is 1hour, but the kube2iam asks 30 mins temp roles.
if you pass the iam-role-session-ttl: 1600s then kube2iam will ask ~53 mins because of a bug/feature (jtblin/kube2iam#240) https://www.bluematador.com/blog/iam-access-in-kubernetes-kube2iam-vs-kiam
If you need more time you need to set max session duration to higher.

@dharanui
Copy link
Author

Hi @Lyndon-Li / @SCLogo , any idea when will this be fixed so that we can make it work with IRSA?

@lestich
Copy link

lestich commented Feb 24, 2025

Hello everyone,

We recently encountered the same issue. Velero had been working correctly on AWS, but it suddenly stopped. After investigating, we discovered that Velero does not refresh its token, which causes the maintenance jobs to stop running. Unfortunately, there are no error messages or warnings to indicate this problem.

We run hourly backups with a 24-hour TTL, so when the cleanup jobs don’t run, those backups accumulate and are never deleted. This eventually overwhelms the reconcilers, leading to failures in subsequent backups.

We updated to version 1.15.2 along with the AWS plugin 1.11.1, but the issue persists. Has there been any progress or additional information regarding this problem?

Thank you in advance for any insights you can provide.

@filipe-silva-magalhaes-alb

+1

DZDomi added a commit to DZDomi/velero that referenced this issue May 22, 2025
DZDomi added a commit to DZDomi/velero that referenced this issue May 22, 2025
Signed-off-by: Dominik De Zordo <dzdomi@gmail.com>
DZDomi added a commit to DZDomi/velero that referenced this issue May 22, 2025
…can expire

Signed-off-by: dominik <dzdomi@gmail.com>
DZDomi added a commit to DZDomi/velero that referenced this issue May 22, 2025
Signed-off-by: Dominik De Zordo <dzdomi@gmail.com>
Signed-off-by: dominik <dzdomi@gmail.com>
DZDomi added a commit to DZDomi/velero that referenced this issue May 22, 2025
…can expire

Signed-off-by: dominik <dzdomi@gmail.com>
DZDomi added a commit to DZDomi/velero that referenced this issue May 22, 2025
Signed-off-by: Dominik De Zordo <dzdomi@gmail.com>
Signed-off-by: dominik <dzdomi@gmail.com>
@DZDomi
Copy link

DZDomi commented May 22, 2025

Hey guys!

I was running into the same issue and spend some time debugging. After looking at the git commit history for the AWS credentials provider, a found the culprit of the issue. Specifically the following commit caused all credentials to expire: 30728c2 (from version 1.12 -> 1.13)

In this commit the following line was removed:

if os.Getenv(awsRoleEnvVar) != "" {
    return nil, nil
}

By removing this specific line, all code that would call this function would always receive AWS credentials, even if they can expire. This applies to all code that would use IAM roles or any other form of temporary credentials. The current implementation of the function creates one time credentials, either via static AWS access/keys secrets or by assuming a role and then returning them to the caller. By returning static credentials to the caller, which can expire after x amount of time all data uploader/downloader can fail after x amount of time with the issue in this ticket.

You can see that the one time credentials are passed here (and then never refreshed):
https://github.com/vmware-tanzu/velero/blob/main/pkg/repository/provider/unified_repo.go#L492

Which then in turn calls for example kopia lib here:

c.options.AccessKeyID = optionalHaveString(udmrepo.StoreOptionS3KeyID, flags)

Kopia therefore received credentials that will expire after x amount of time, instead of just relying on the AWS SDK, which will handle expiring tokens.

I already tried a modified version of my PR inside our environment and it works now perfectly. I will make sure the current PR will also work as expected and modify till latest tomorrow if any change is needed

Ref: #8970

@DZDomi
Copy link

DZDomi commented May 27, 2025

Hey guys quick update: I tested the changes from my PR on our cluster and I can confirm it does not timeout anymore after 1 hour (see screenshot). We are using IRSA, container config:

env:
  - name: AWS_STS_REGIONAL_ENDPOINTS
    value: regional
  - name: AWS_DEFAULT_REGION
    value: us-west-2
  - name: AWS_REGION
    value: us-west-2
  - name: AWS_ROLE_ARN
    value: arn:aws:iam::<account-id>:role/<role-name>
  - name: AWS_WEB_IDENTITY_TOKEN_FILE
    value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token

A screenshot from a job that was running longer than 1 hour:

Image

@Lyndon-Li can we get this merged in, so other people can get the fix in the latest patch release? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants