Skip to content

parallel catchup v2 sometimes fail immedately on s3 failure #354

@jayz22

Description

@jayz22

In a recent mission parallel catchup failed with the following message:

2025-12-17T09:12:18.582 GAJSL [default INFO] Performing maintenance
2025-12-17T09:12:18.582 GAJSL [History INFO] Trimming history <= ledger 28109119
2025-12-17T09:12:34.801 GAJSL [Process WARNING] process 7596 exited 1: aws s3 cp --region us-east-1 s3://ssc-history-archive/prd/core-live/core_live_001/results/01/ac/e9/results-01ace9bf.xdr.gz /data/buckets/tmp/catchup-c0109315de5346b9/results/01/ac/e9/results-01ace9bf.xdr.gz.tmp
2025-12-17T09:12:34.801 GAJSL [History WARNING] Could not download file: archive core_live_001 maybe missing file results/01/ac/e9/results-01ace9bf.xdr.gz
2025-12-17T09:12:34.806 GAJSL [History INFO] Catching up to ledger 28113151: Download & apply checkpoints: num checkpoints left to apply:255 (0% done)
2025-12-17T09:12:34.806 GAJSL [History INFO] Catching up to ledger 28113151: Failed: catchup-seq
2025-12-17T09:12:34.806 GAJSL [History WARNING] Catchup failed

full log attached here (the mission artifacts will be gone soon)
stellar-core-2025-12-17_08-48-07.log
There is no other failed message, and the process fail immediately after the s3 failure, without retry.
The simplest command equivalent to the mission

dotnet run --project src/App/App.fsproj -- mission  HistoryPubnetParallelCatchupV2 --destination ./logs --image=docker-registry.services.stellar-ops.com/dev/stellar-core:25.0.1-2925.0d5731bae.noble-vnext-buildtests --old-image=docker-registry.services.stellar-ops.com/dev/stellar-core:24.1.0-2861.5a7035d49.focal-buildtests --netdelay-image=docker-registry.services.stellar-ops.com/dev/sdf-netdelay:latest --postgres-image=docker-registry.services.stellar-ops.com/dev/postgres:9.5.22 --nginx-image=docker-registry.services.stellar-ops.com/dev/nginx:latest --prometheus-exporter-image=docker-registry.services.stellar-ops.com/dev/stellar-core-prometheus-exporter:latest --ingress-internal-domain=stellar-supercluster.kube001-ssc-eks.services.stellar-ops.com --job-monitor-external-host=ssc-job-monitor-eks.services.stellar-ops.com --pubnet-parallel-catchup-starting-ledger=0 --require-node-labels-pc-v2=purpose:catchup --tolerate-node-taints-pc-v2=catchup:NoSchedule --require-node-labels=purpose:largetests --tolerate-node-taints=largetests --asan-options quarantine_size_mb=1:malloc_context_size=5:alloc_dealloc_mismatch=0 --catchup-skip-known-results-for-testing=true --pubnet-parallel-catchup-ledgers-per-job=1280 --service-account-annotations-pc-v2=eks.amazonaws.com/role-arn:arn:aws:iam::746476062914:role/kube001-ssc-eks-supercluster --s3-history-mirror-override-pc-v2=ssc-history-archive/prd/core-live --pubnet-parallel-catchup-num-workers=2

Stellar core should have fail retries with exponential backoff. And when I try to reproduce this, I can observe the retry working, here is the logs I observed in my local run (wth auth error, but retries).

We need to investigate why this behavior happens.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions