Skip to content

feat(git): serve fresh snapshots via delta bundles#197

Merged
worstell merged 3 commits intomainfrom
worstell/sync-snapshot-freshness
Mar 19, 2026
Merged

feat(git): serve fresh snapshots via delta bundles#197
worstell merged 3 commits intomainfrom
worstell/sync-snapshot-freshness

Conversation

@worstell
Copy link
Copy Markdown
Contributor

@worstell worstell commented Mar 17, 2026

Summary

Serves fresh git snapshots by supplementing cached S3 snapshots with small delta bundles, avoiding expensive full snapshot regeneration on every request.

Problem

For busy repos, the cached snapshot becomes stale quickly. Previously, detecting staleness meant regenerating the entire snapshot from scratch — effectively invalidating the cache on every request. The regenerated snapshot would itself be stale by the next request.

Solution

When the cached snapshot HEAD differs from the local mirror HEAD, cachew:

  1. Streams the cached snapshot as-is (application/zstd)
  2. Sets an X-Cachew-Bundle-Url response header pointing to a separate /snapshot.bundle endpoint
  3. The client fetches the bundle in parallel with snapshot extraction
  4. The bundle is a small git bundle covering commits between the snapshot HEAD and the mirror current HEAD

Delta bundles are cached in S3 (2h TTL) so any cachew pod can serve them, eliminating cross-pod 404s.

Key details

  • Wire format: Snapshot response is always plain application/zstd. Bundle is served at a separate URL as application/x-git-bundle. Fully backward compatible — old clients ignore the header.
  • Freshness: Bounded by the mirror fetch interval (15m), but in practice much fresher since each snapshot request triggers an async mirror refresh.
  • Bundle size: Bounded by the S3 snapshot regeneration interval (1h), keeping bundles small.
  • S3 caching: Bundles are proactively generated and cached in S3 during snapshot serving, and also cached on-demand at the bundle endpoint.
  • No read locks on git bundle create or git rev-parse — git handles its own file-level locking, consistent with serveFromBackend.

Deploy order

Cachew first (backward compatible), then blox.

@worstell worstell requested a review from a team as a code owner March 17, 2026 20:14
@worstell worstell requested review from js-murph and removed request for a team March 17, 2026 20:14
@worstell worstell marked this pull request as draft March 17, 2026 20:35
@worstell worstell force-pushed the worstell/sync-snapshot-freshness branch 3 times, most recently from 90c0530 to 01883f6 Compare March 18, 2026 20:13
@worstell worstell changed the title git: serve fresh snapshots by fetching synchronously when cached snapshot is stale git: serve fresh snapshots via delta bundles Mar 18, 2026
@worstell worstell changed the title git: serve fresh snapshots via delta bundles feat(git): serve fresh snapshots via delta bundles Mar 18, 2026
@worstell worstell force-pushed the worstell/sync-snapshot-freshness branch from 01883f6 to 5f94ce2 Compare March 18, 2026 20:37
@worstell worstell marked this pull request as ready for review March 18, 2026 20:39
@worstell worstell force-pushed the worstell/sync-snapshot-freshness branch 2 times, most recently from 90b6993 to 185f4ee Compare March 18, 2026 21:36
Comment thread internal/strategy/git/snapshot.go Outdated
@worstell worstell force-pushed the worstell/sync-snapshot-freshness branch from 185f4ee to 73a7d43 Compare March 18, 2026 22:48
Comment thread internal/strategy/git/snapshot.go Outdated
// delta bundle covering commits between the snapshot's HEAD and the mirror's
// current HEAD. When no bundle is needed (common case), the snapshot is
// streamed directly without buffering.
func (s *Strategy) serveSnapshotWithBundle(ctx context.Context, w http.ResponseWriter, reader io.ReadCloser, headers http.Header, repo *gitclone.Repository, upstreamURL string) error {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach could be to return a header like X-Cachew-Bundle-URL that tells the client where to retrieve the bundle. Only reason I say that is that it's a bit weird to request snapshot.tar.zst and get a weird combined response.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that is fair- i like that approach

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the new API, i'm refactoring a bit to also store bundles in s3 with an aggressive TTL. without that i'm seeing occasional errors when the client tries to retrieve the bundle and hits a different cachew node that doesn't have it. it's not the worst but when that happens no bundle gets applied and the snapshot can be up to an hour old

@worstell worstell marked this pull request as draft March 19, 2026 00:45
Comment thread internal/strategy/git/snapshot.go Outdated
}

func (s *Strategy) createBundle(ctx context.Context, repo *gitclone.Repository, baseCommit string) ([]byte, error) {
return gitclone.WithReadLockReturn(repo, func() ([]byte, error) { //nolint:wrapcheck // error is already wrapped inside the closure
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we figure out if we actually need read locks when doing operations like this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no we do not- gh handles it. removed

worstell and others added 3 commits March 19, 2026 12:25
Workstations now receive a snapshot at HEAD instead of a potentially
stale cached snapshot. On each snapshot request, cachew:

1. Fetches the mirror synchronously (O(delta), cheap)
2. Serves the cached snapshot tar.zst (bulk of the repo, fast)
3. Appends a git bundle of commits between the snapshot's HEAD and
   the mirror's current HEAD (O(delta), cheap)

The response uses a header-framed format:
- Content-Type: application/x-cachew-snapshot
- X-Cachew-Snapshot-Size: byte length of the snapshot portion
- Body: [snapshot.tar.zst][delta.bundle]

The periodic snapshot job still regenerates the full snapshot on
interval, bounding the bundle size. This avoids the expensive
O(repo-size) tar+zstd regeneration on the hot path while ensuring
workstations always start near HEAD.

Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019d02ea-462f-733a-bef4-03ee9e2d23c8
Instead of concatenating the bundle into the snapshot response body,
return an X-Cachew-Bundle-URL header pointing to a new /snapshot.bundle
endpoint. This gives cleaner HTTP semantics and lets clients fetch the
bundle in parallel with snapshot extraction.

Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019d0383-345f-7659-8538-30bc7c9a7e6d
When a bundle is generated (either proactively during snapshot serving
or on-demand at the bundle endpoint), cache it in S3 with a 2h TTL.
The bundle endpoint checks cache first, so any pod can serve bundles
without needing the local mirror. Eliminates 404s when the bundle
request is load-balanced to a different pod.

Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019d0383-345f-7659-8538-30bc7c9a7e6d
@worstell worstell force-pushed the worstell/sync-snapshot-freshness branch from 73a7d43 to 1da3e98 Compare March 19, 2026 19:34
@worstell worstell changed the title feat(git): serve fresh snapshots via delta bundles feat: serve fresh snapshots via delta bundles Mar 19, 2026
@worstell worstell changed the title feat: serve fresh snapshots via delta bundles feat(git): serve fresh snapshots via delta bundles Mar 19, 2026
@worstell worstell marked this pull request as ready for review March 19, 2026 19:37
@worstell worstell requested a review from alecthomas March 19, 2026 19:37
@worstell worstell merged commit c3b4733 into main Mar 19, 2026
9 checks passed
@worstell worstell deleted the worstell/sync-snapshot-freshness branch March 19, 2026 23:53
repoPath, err := gitclone.RepoPathFromURL(upstreamURL)
if err == nil {
bundleURL := fmt.Sprintf("/git/%s/snapshot.bundle?base=%s", repoPath, snapshotCommit)
w.Header().Set("X-Cachew-Bundle-Url", bundleURL)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

if _, err = io.Copy(w, reader); err != nil {
logger.ErrorContext(ctx, "Failed to stream snapshot", "upstream", upstreamURL, "error", err)

bundleData, err := s.createBundle(ctx, repo, base)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand why this wouldn't just return an io.Reader rather than buffering?

return nil, errors.Wrapf(err, "git bundle create: %s", string(output))
}

data, err := os.ReadFile(bundlePath) //nolint:gosec // bundlePath is a temp file we created
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to buffer this. The file can be deleted while open, and the reader will continue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants