fix: break corrupt mirror snapshot cycle by worstell · Pull Request #190 · block/cachew

worstell · 2026-03-13T06:19:26Z

Problem

When a corrupt or empty mirror snapshot exists in S3, pods enter a poison cycle:

Pod restores corrupt snapshot → 80KB empty mirror (zero refs, no pack files)
Post-restore git fetch fails — lowSpeedLimit (1KB/s for 60s) trips during server-side pack computation for the large delta
Code logs a warning but proceeds: schedules snapshot jobs that immediately re-upload the empty mirror to S3
Next pod restart (or S3 cleanup) restores the same empty snapshot → repeat

This cycle survived the fixes in #188 and #189 because those PRs prevented the creation of corruption (concurrent restores and concurrent fetch-during-tar) but not the propagation of already-corrupt snapshots.

Fix

Three changes break the cycle:

FetchLenient: Post-restore and startup fetches omit the lowSpeedLimit check, matching executeClone's behavior. Large deltas after snapshot restore trigger GitHub's server-side pack computation which stalls at near-zero transfer rate for minutes, tripping the 1KB/s threshold.
ResetToEmpty + fallback to clone: When the post-restore fetch fails, the corrupt mirror directory is removed and the repo state is reset to Empty. The code then falls through to a fresh git clone --mirror instead of serving and re-uploading stale data.
Skip snapshot scheduling on failed fetch: Snapshot and repack jobs are only scheduled after a successful fetch, both in the startup path (DiscoverExisting) and the post-restore path (startClone).

Testing

All existing tests pass. The fix was validated against staging logs showing the poison cycle in action.

When a corrupt or empty mirror snapshot exists in S3, pods restore it, the post-restore fetch fails, and then snapshot jobs immediately re-upload the corrupt mirror — perpetuating the cycle even after manual S3 cleanup. Three changes break this cycle: 1. FetchLenient: post-restore and startup fetches now omit the lowSpeedLimit check (same as executeClone), since large deltas after snapshot restore trigger server-side pack computation that stalls at near-zero transfer for minutes, tripping the 1KB/s threshold. 2. ResetToEmpty + fallback to clone: when the post-restore fetch fails, the corrupt mirror is removed and the repo state is reset to Empty so the code falls through to a fresh git clone --mirror instead of serving and re-uploading stale data. 3. Skip snapshot scheduling on failed fetch: snapshot and repack jobs are only scheduled after a successful fetch, both in the startup path and the post-restore path. Co-authored-by: Amp <amp@ampcode.com> Amp-Thread-ID: https://ampcode.com/threads/T-019ce5d1-9385-7026-8e69-78903ce99c47

worstell requested a review from a team as a code owner March 13, 2026 06:19

worstell requested review from jrobotham-square and removed request for a team March 13, 2026 06:19

worstell changed the title ~~fix: break corrupt mirror snapshot poison cycle~~ fix: break corrupt mirror snapshot cycle Mar 13, 2026

jrobotham-square approved these changes Mar 13, 2026

View reviewed changes

worstell force-pushed the fix-post-restore-fetch-corruption branch from 8255718 to b113e07 Compare March 13, 2026 06:27

worstell merged commit 1339f42 into main Mar 16, 2026
9 of 10 checks passed

worstell deleted the fix-post-restore-fetch-corruption branch March 16, 2026 20:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: break corrupt mirror snapshot cycle#190

fix: break corrupt mirror snapshot cycle#190
worstell merged 1 commit intomainfrom
fix-post-restore-fetch-corruption

worstell commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

worstell commented Mar 13, 2026

Problem

Fix

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants