Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover after 1 hr with IRSA #8173

dharanui · 2024-09-01T17:26:38Z

velero version: 1.14.1
error: async write error: "unable to write content chunk 96 of FILE:000002: mutable parameters: unable to read format blob: error getting kopia.repository blob: The provided token has expired: mutable parameters: unable to read format blob: error getting kopia.repository blob: The provided token has expired"

The datauploads are failing after almost one hour of running.
Tried also to incraese repo maintainence frequency , but no luck

Lyndon-Li · 2024-09-02T04:59:12Z

Looks like the token to access object store has expired.

dharanui · 2024-09-02T05:11:47Z

its expires every one hour? because datauploads which takes less than an hour runs and completes.. the ones which take longer are getting cancelled. In the logs of node-agent we see this error at that time

Lyndon-Li · 2024-09-02T05:20:51Z

The expiration time of the token is not set by Velero, so you need to check how the token was created.

dharanui · 2024-09-02T07:37:41Z

but we were not getting this issue in 1.12

dharanui · 2024-09-02T10:32:46Z

We use IRSA and I see iam token valid for 24h.

volumes:
  - name: aws-iam-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: sts.amazonaws.com
          expirationSeconds: 86400
          path: token

Looks like this commit seems to be relevant: https://github.com/vmware-tanzu/velero/pull/7374/files ??

Lyndon-Li · 2024-09-04T06:21:21Z

We use IRSA and I see iam token valid for 24h.
volumes:
  - name: aws-iam-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: sts.amazonaws.com
          expirationSeconds: 86400
          path: token
Looks like this commit seems to be relevant: https://github.com/vmware-tanzu/velero/pull/7374/files ??

Why that commit is related? Have you specified BSL->credentialFile?

dharanui · 2024-09-04T14:53:41Z

oops sorry , no we dont use credentialFile.
It is not giving that error now when we rolled back to 1.12.
Could it be that the repository maintainance job is recreating the token or something on that lines?

Or may be kopia version changes with velero upgrade?

Lyndon-Li · 2024-09-05T02:40:06Z

Neither Velero nor Kopia could change the token being used, I guess there might be another token specified. We also have test cases for IRSA, but we didn't see the problem as here.

SCLogo · 2024-09-25T06:48:58Z

issue happens with velero 1.13.2 with datamover.

dharanui · 2024-09-25T11:23:38Z

@Lyndon-Li this was working fine until 1.12 and started happening since upgrade to 1.13 also 1.14. Do we know what has changed since 1.12? This is blocking us from upgrading to 1.14 currently

catalinpan · 2024-10-03T10:45:14Z

As mentioned above I'm getting the same error for restores which are longer than 1h. The restore will fail based on the fsBackupTimeout so the error is not detected by the restore process.

I'm using below images with IAM role and IRSA:

velero:v1.14.1
velero/velero-plugin-for-aws:v1.10.1
velero/velero-restore-helper:v1.14.1 (this has issues with kopia cache Make kopia repo cache place configurable #7725)

On the restore-wait init container this message shows up in a loop

The filesystem restore done file /restores/data/.velero/file123 is not found yet. Retry later.

In the node-agent this message show up

time="2024-10-02T23:20:34Z" level=error msg="Async fs restore data path failed" controller=PodVolumeRestore error="Failed to run kopia restore: Failed to copy snapshot data to the target: restore error: copy file: error creating file: cannot write data to file %q /host_pods/a2e48cae-8c75-4971-abb0-cbadb80674c8/volumes/kubernetes.io~csi/pvc-d38b075b-f1f3-4c59-8384-15f9d25fa782/mount/export/2024-Jul-12--0100.zip: unexpected content error: error getting cached content from blob \"pb3f655d8f0c66aa9377a3d660c143a45-s83fa0c09e23487b612d\": failed to get blob with ID pb3f655d8f0c66aa9377a3d660c143a45-s83fa0c09e23487b612d: The provided token has expired" logSource="pkg/controller/pod_volume_restore_controller.go:332" pvr=pvc-20241002183056-20241002221832khkt

The restore worked without any issues when downgraded to below versions

velero:v1.12.4
velero/velero-plugin-for-aws:v1.8.0
velero/velero-restore-helper:v1.10.2

Hope this will help a bit.

dharanui · 2024-10-04T07:29:22Z

Thanks @catalinpan .
We are using CSI snapshot (https://velero.io/docs/main/csi-snapshot-data-movement/) instead of fsb.
For us the backup itself is failing if beyond one hour for velero v1.14.1. Downgrading to 1.12 made the backups work.

Is there any workaround to make this work in 1.14?

SCLogo · 2024-10-08T14:20:24Z

we use velero 1.14.1
aws plugin: 1.10.1
with kube2iam 3600s tokens
Csi backup with data mover (kopia)
What I probably see in logs, that velero or kopia does not request new token when it expired just goes failed (canceled)
Kopia requests aws token using kube2iam @ 12:03:46 . It starts the upload and finishes. 1 hour later (we use hourly backups), another dataupload request created (2024-10-08T14:02:49Z)for the same resources and it exit with token has expired error (2024-10-08T14:02:55Z) and new token requested (14:03:25).
can it somehow set that before it goes failed with expired token error, just try to request a new token ?

Lyndon-Li · 2024-10-09T02:59:04Z

This may be the expected behavior for now, multiple DUs may be created at the same time but processed one by one. If the 1st DU takes more than 1 hour, the second one's token will timeout.
The data mover pod doesn't support IRSA, this may be the cause.

SCLogo · 2024-10-09T06:09:20Z

those are two different backups. 1st finish w/o issue. Second starts and DU created eariler then last run so it gets the old key that expire soon and DU goes cancelled. If Velero would try to get a new key before exit with error, this problem could not come up. If I reduce the duration of the key I can just hide the issue, but once DU needs more time than I set,then I need to set higher duration.

SCLogo · 2024-10-09T07:25:54Z

the default duration for iam role is 1 hour. we use that one

dharanui · 2024-11-19T09:25:06Z

Is increasing the default duration helping in this case? @SCLogo

SCLogo · 2024-11-19T12:20:49Z

In our case yes and no. yes because we saw less issue related to expired key, but not in all case. The best solution would be if the kopia or Velero tries to get a temp key before it exits with expiration error.

…

On Tue, Nov 19, 2024 at 10:25 AM dharanui ***@***.***> wrote: Is increasing the default duration helping in this case? @SCLogo <https://github.com/SCLogo> — Reply to this email directly, view it on GitHub <#8173 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANVQPA3X2DLB4ALW3XYGGZT2BL7YPAVCNFSM6AAAAABNPAN622VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBVGE2DMNJWGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- -- Balazs Varga |* DevOps*

dharanui · 2024-12-18T12:41:31Z

Hi @SCLogo / @Lyndon-Li , can you help me how to override DurationSeconds while velero is performing assumeRole ? I am using IRSA. Updation maxSessionDuration on role is not helping because default duration while assuming role is 1hr.
https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html

according to aws/aws-cli#9021 there is no environmental variable for that currently.

SCLogo · 2024-12-21T19:59:50Z

@dharanui . I am using kube2iam. As the default max duration is 1hour, but the kube2iam asks 30 mins temp roles.
if you pass the iam-role-session-ttl: 1600s then kube2iam will ask ~53 mins because of a bug/feature (jtblin/kube2iam#240) https://www.bluematador.com/blog/iam-access-in-kubernetes-kube2iam-vs-kiam
If you need more time you need to set max session duration to higher.

dharanui · 2025-01-13T11:32:46Z

Hi @Lyndon-Li / @SCLogo , any idea when will this be fixed so that we can make it work with IRSA?

lestich · 2025-02-24T08:29:16Z

Hello everyone,

We recently encountered the same issue. Velero had been working correctly on AWS, but it suddenly stopped. After investigating, we discovered that Velero does not refresh its token, which causes the maintenance jobs to stop running. Unfortunately, there are no error messages or warnings to indicate this problem.

We run hourly backups with a 24-hour TTL, so when the cleanup jobs don’t run, those backups accumulate and are never deleted. This eventually overwhelms the reconcilers, leading to failures in subsequent backups.

We updated to version 1.15.2 along with the AWS plugin 1.11.1, but the issue persists. Has there been any progress or additional information regarding this problem?

Thank you in advance for any insights you can provide.

filipe-silva-magalhaes-alb · 2025-05-07T08:40:56Z

+1

…can expire

Signed-off-by: Dominik De Zordo <dzdomi@gmail.com>

…can expire Signed-off-by: dominik <dzdomi@gmail.com>

Signed-off-by: Dominik De Zordo <dzdomi@gmail.com> Signed-off-by: dominik <dzdomi@gmail.com>

…can expire Signed-off-by: dominik <dzdomi@gmail.com>

Signed-off-by: Dominik De Zordo <dzdomi@gmail.com> Signed-off-by: dominik <dzdomi@gmail.com>

DZDomi · 2025-05-22T21:30:08Z

Hey guys!

I was running into the same issue and spend some time debugging. After looking at the git commit history for the AWS credentials provider, a found the culprit of the issue. Specifically the following commit caused all credentials to expire: 30728c2 (from version 1.12 -> 1.13)

In this commit the following line was removed:

if os.Getenv(awsRoleEnvVar) != "" {
    return nil, nil
}

By removing this specific line, all code that would call this function would always receive AWS credentials, even if they can expire. This applies to all code that would use IAM roles or any other form of temporary credentials. The current implementation of the function creates one time credentials, either via static AWS access/keys secrets or by assuming a role and then returning them to the caller. By returning static credentials to the caller, which can expire after x amount of time all data uploader/downloader can fail after x amount of time with the issue in this ticket.

You can see that the one time credentials are passed here (and then never refreshed):
https://github.com/vmware-tanzu/velero/blob/main/pkg/repository/provider/unified_repo.go#L492

Which then in turn calls for example kopia lib here:

velero/pkg/repository/udmrepo/kopialib/backend/s3.go

Line 41 in bfd9bc5

c.options.AccessKeyID = optionalHaveString(udmrepo.StoreOptionS3KeyID, flags)

Kopia therefore received credentials that will expire after x amount of time, instead of just relying on the AWS SDK, which will handle expiring tokens.

I already tried a modified version of my PR inside our environment and it works now perfectly. I will make sure the current PR will also work as expected and modify till latest tomorrow if any change is needed

Ref: #8970

DZDomi · 2025-05-27T11:49:54Z

Hey guys quick update: I tested the changes from my PR on our cluster and I can confirm it does not timeout anymore after 1 hour (see screenshot). We are using IRSA, container config:

env:
  - name: AWS_STS_REGIONAL_ENDPOINTS
    value: regional
  - name: AWS_DEFAULT_REGION
    value: us-west-2
  - name: AWS_REGION
    value: us-west-2
  - name: AWS_ROLE_ARN
    value: arn:aws:iam::<account-id>:role/<role-name>
  - name: AWS_WEB_IDENTITY_TOKEN_FILE
    value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token

A screenshot from a job that was running longer than 1 hour:

@Lyndon-Li can we get this merged in, so other people can get the fix in the latest patch release? Thanks!

dharanui changed the title ~~from velero version 1.14 we are getting error of expired toekn for backuing up data using datamover~~ from velero version 1.14 we are getting error of expired token for backuing up data using datamover Sep 1, 2024

ywk253100 mentioned this issue Sep 2, 2024

--default-repo-maintain-frequency is not working even with changed value #8156

Closed

reasonerjt added the Needs info Waiting for information label Sep 2, 2024

reasonerjt assigned Lyndon-Li Sep 2, 2024

dharanui changed the title ~~from velero version 1.14 we are getting error of expired token for backuing up data using datamover~~ Since velero version 1.14 we are getting error of expired token for backuing up data using datamover Sep 2, 2024

dharanui changed the title ~~Since velero version 1.14 we are getting error of expired token for backuing up data using datamover~~ Since velero version 1.14 we are getting error of expired token for backing up data using datamover Sep 2, 2024

Lyndon-Li added the Area/Cloud/AWS label Sep 6, 2024

dharanui changed the title ~~Since velero version 1.14 we are getting error of expired token for backing up data using datamover~~ Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover Oct 5, 2024

dharanui changed the title ~~Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover~~ Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover after 1 hr Oct 7, 2024

dharanui changed the title ~~Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover after 1 hr~~ Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover after 1 hr with IRSA Oct 7, 2024

Lyndon-Li added backlog area/datamover labels Oct 9, 2024

dharanui mentioned this issue Feb 7, 2025

Velero backups failing due to blob missing from the repository #8654

Closed

ywk253100 assigned reasonerjt Feb 10, 2025

DZDomi added a commit to DZDomi/velero that referenced this issue May 22, 2025

fix(vmware-tanzu#8173): Do not return static aws credentials if they …

9045b42

…can expire

DZDomi linked a pull request May 22, 2025 that will close this issue

fix(#8173): Do not return static aws credentials if they can expire #8970

Open

3 tasks

DZDomi added a commit to DZDomi/velero that referenced this issue May 22, 2025

chore(vmware-tanzu#8173): Add changelog

37f7914

Signed-off-by: Dominik De Zordo <dzdomi@gmail.com>

DZDomi added a commit to DZDomi/velero that referenced this issue May 22, 2025

fix(vmware-tanzu#8173): Do not return static aws credentials if they …

d285d4b

…can expire Signed-off-by: dominik <dzdomi@gmail.com>

DZDomi added a commit to DZDomi/velero that referenced this issue May 22, 2025

chore(vmware-tanzu#8173): Add changelog

dd4003d

Signed-off-by: Dominik De Zordo <dzdomi@gmail.com> Signed-off-by: dominik <dzdomi@gmail.com>

DZDomi added a commit to DZDomi/velero that referenced this issue May 22, 2025

fix(vmware-tanzu#8173): Do not return static aws credentials if they …

a0dcfb0

…can expire Signed-off-by: dominik <dzdomi@gmail.com>

DZDomi added a commit to DZDomi/velero that referenced this issue May 22, 2025

chore(vmware-tanzu#8173): Add changelog

96a3938

Signed-off-by: Dominik De Zordo <dzdomi@gmail.com> Signed-off-by: dominik <dzdomi@gmail.com>

Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover after 1 hr with IRSA #8173

Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover after 1 hr with IRSA #8173

Comments

dharanui commented Sep 1, 2024

Lyndon-Li commented Sep 2, 2024

Uh oh!

dharanui commented Sep 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lyndon-Li commented Sep 2, 2024

Uh oh!

dharanui commented Sep 2, 2024

Uh oh!

dharanui commented Sep 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lyndon-Li commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dharanui commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lyndon-Li commented Sep 5, 2024

Uh oh!

SCLogo commented Sep 25, 2024

Uh oh!

dharanui commented Sep 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

catalinpan commented Oct 3, 2024

Uh oh!

dharanui commented Oct 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SCLogo commented Oct 8, 2024

Uh oh!

Lyndon-Li commented Oct 9, 2024

Uh oh!

SCLogo commented Oct 9, 2024

Uh oh!

SCLogo commented Oct 9, 2024

Uh oh!

dharanui commented Nov 19, 2024

Uh oh!

SCLogo commented Nov 19, 2024 via email

Uh oh!

dharanui commented Dec 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SCLogo commented Dec 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dharanui commented Jan 13, 2025

Uh oh!

lestich commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

filipe-silva-magalhaes-alb commented May 7, 2025

Uh oh!

DZDomi commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DZDomi commented May 27, 2025

Uh oh!

dharanui commented Sep 2, 2024 •

edited

Loading

dharanui commented Sep 2, 2024 •

edited

Loading

Lyndon-Li commented Sep 4, 2024 •

edited

Loading

dharanui commented Sep 4, 2024 •

edited

Loading

dharanui commented Sep 25, 2024 •

edited

Loading

dharanui commented Oct 4, 2024 •

edited

Loading

dharanui commented Dec 18, 2024 •

edited

Loading

SCLogo commented Dec 21, 2024 •

edited

Loading

lestich commented Feb 24, 2025 •

edited

Loading

DZDomi commented May 22, 2025 •

edited

Loading