-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Labels
bugSomething isn't workingSomething isn't working
Description
In a recent mission parallel catchup failed with the following message:
2025-12-17T09:12:18.582 GAJSL [default INFO] Performing maintenance
2025-12-17T09:12:18.582 GAJSL [History INFO] Trimming history <= ledger 28109119
2025-12-17T09:12:34.801 GAJSL [Process WARNING] process 7596 exited 1: aws s3 cp --region us-east-1 s3://ssc-history-archive/prd/core-live/core_live_001/results/01/ac/e9/results-01ace9bf.xdr.gz /data/buckets/tmp/catchup-c0109315de5346b9/results/01/ac/e9/results-01ace9bf.xdr.gz.tmp
2025-12-17T09:12:34.801 GAJSL [History WARNING] Could not download file: archive core_live_001 maybe missing file results/01/ac/e9/results-01ace9bf.xdr.gz
2025-12-17T09:12:34.806 GAJSL [History INFO] Catching up to ledger 28113151: Download & apply checkpoints: num checkpoints left to apply:255 (0% done)
2025-12-17T09:12:34.806 GAJSL [History INFO] Catching up to ledger 28113151: Failed: catchup-seq
2025-12-17T09:12:34.806 GAJSL [History WARNING] Catchup failed
full log attached here (the mission artifacts will be gone soon)
stellar-core-2025-12-17_08-48-07.log
There is no other failed message, and the process fail immediately after the s3 failure, without retry.
The simplest command equivalent to the mission
dotnet run --project src/App/App.fsproj -- mission HistoryPubnetParallelCatchupV2 --destination ./logs --image=docker-registry.services.stellar-ops.com/dev/stellar-core:25.0.1-2925.0d5731bae.noble-vnext-buildtests --old-image=docker-registry.services.stellar-ops.com/dev/stellar-core:24.1.0-2861.5a7035d49.focal-buildtests --netdelay-image=docker-registry.services.stellar-ops.com/dev/sdf-netdelay:latest --postgres-image=docker-registry.services.stellar-ops.com/dev/postgres:9.5.22 --nginx-image=docker-registry.services.stellar-ops.com/dev/nginx:latest --prometheus-exporter-image=docker-registry.services.stellar-ops.com/dev/stellar-core-prometheus-exporter:latest --ingress-internal-domain=stellar-supercluster.kube001-ssc-eks.services.stellar-ops.com --job-monitor-external-host=ssc-job-monitor-eks.services.stellar-ops.com --pubnet-parallel-catchup-starting-ledger=0 --require-node-labels-pc-v2=purpose:catchup --tolerate-node-taints-pc-v2=catchup:NoSchedule --require-node-labels=purpose:largetests --tolerate-node-taints=largetests --asan-options quarantine_size_mb=1:malloc_context_size=5:alloc_dealloc_mismatch=0 --catchup-skip-known-results-for-testing=true --pubnet-parallel-catchup-ledgers-per-job=1280 --service-account-annotations-pc-v2=eks.amazonaws.com/role-arn:arn:aws:iam::746476062914:role/kube001-ssc-eks-supercluster --s3-history-mirror-override-pc-v2=ssc-history-archive/prd/core-live --pubnet-parallel-catchup-num-workers=2
Stellar core should have fail retries with exponential backoff. And when I try to reproduce this, I can observe the retry working, here is the logs I observed in my local run (wth auth error, but retries).
We need to investigate why this behavior happens.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working