parallel catchup v2 sometimes fail immedately on s3 failure

In a [recent mission](https://buildmeister-v3.stellar-ops.com/job/Core/job/stellar-supercluster/1573/) parallel catchup failed with the following message:

```
2025-12-17T09:12:18.582 GAJSL [default INFO] Performing maintenance
2025-12-17T09:12:18.582 GAJSL [History INFO] Trimming history <= ledger 28109119
2025-12-17T09:12:34.801 GAJSL [Process WARNING] process 7596 exited 1: aws s3 cp --region us-east-1 s3://ssc-history-archive/prd/core-live/core_live_001/results/01/ac/e9/results-01ace9bf.xdr.gz /data/buckets/tmp/catchup-c0109315de5346b9/results/01/ac/e9/results-01ace9bf.xdr.gz.tmp
2025-12-17T09:12:34.801 GAJSL [History WARNING] Could not download file: archive core_live_001 maybe missing file results/01/ac/e9/results-01ace9bf.xdr.gz
2025-12-17T09:12:34.806 GAJSL [History INFO] Catching up to ledger 28113151: Download & apply checkpoints: num checkpoints left to apply:255 (0% done)
2025-12-17T09:12:34.806 GAJSL [History INFO] Catching up to ledger 28113151: Failed: catchup-seq
2025-12-17T09:12:34.806 GAJSL [History WARNING] Catchup failed
```
full log attached here (the mission artifacts will be gone soon)
[stellar-core-2025-12-17_08-48-07.log](https://github.com/user-attachments/files/24223280/stellar-core-2025-12-17_08-48-07.log)
There is no other failed message, and the process fail immediately after the s3 failure, without retry.
The simplest command equivalent to the mission 
```
dotnet run --project src/App/App.fsproj -- mission  HistoryPubnetParallelCatchupV2 --destination ./logs --image=docker-registry.services.stellar-ops.com/dev/stellar-core:25.0.1-2925.0d5731bae.noble-vnext-buildtests --old-image=docker-registry.services.stellar-ops.com/dev/stellar-core:24.1.0-2861.5a7035d49.focal-buildtests --netdelay-image=docker-registry.services.stellar-ops.com/dev/sdf-netdelay:latest --postgres-image=docker-registry.services.stellar-ops.com/dev/postgres:9.5.22 --nginx-image=docker-registry.services.stellar-ops.com/dev/nginx:latest --prometheus-exporter-image=docker-registry.services.stellar-ops.com/dev/stellar-core-prometheus-exporter:latest --ingress-internal-domain=stellar-supercluster.kube001-ssc-eks.services.stellar-ops.com --job-monitor-external-host=ssc-job-monitor-eks.services.stellar-ops.com --pubnet-parallel-catchup-starting-ledger=0 --require-node-labels-pc-v2=purpose:catchup --tolerate-node-taints-pc-v2=catchup:NoSchedule --require-node-labels=purpose:largetests --tolerate-node-taints=largetests --asan-options quarantine_size_mb=1:malloc_context_size=5:alloc_dealloc_mismatch=0 --catchup-skip-known-results-for-testing=true --pubnet-parallel-catchup-ledgers-per-job=1280 --service-account-annotations-pc-v2=eks.amazonaws.com/role-arn:arn:aws:iam::746476062914:role/kube001-ssc-eks-supercluster --s3-history-mirror-override-pc-v2=ssc-history-archive/prd/core-live --pubnet-parallel-catchup-num-workers=2
```


Stellar core should have fail retries with exponential backoff. And when I try to reproduce this, I can observe the retry working, here is the [logs](https://github.com/user-attachments/files/24223528/my_local_logs.txt) I observed in my local run (wth auth error, but retries).



We need to investigate why this behavior happens. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

parallel catchup v2 sometimes fail immedately on s3 failure #354

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

parallel catchup v2 sometimes fail immedately on s3 failure #354

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions