Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
/dist/
/cache/
/state/
7 changes: 6 additions & 1 deletion cachew.hcl
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,11 @@
# mitm = ["artifactory.square.com"]
# }


git {
mirror-root = "./state/git-mirrors"
}

host "https://w3.org" {}

github-releases {
Expand All @@ -15,5 +20,5 @@ github-releases {
memory {}

disk {
root = "./cache"
root = "./state/cache"
}
230 changes: 230 additions & 0 deletions docs/git-strategy-research.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
# Git Caching Strategy Research

## Goals

1. Minimize impact on upstream Git servers
2. Make git clones as fast as possible
3. Efficiently handle incremental fetches

## Three-Layer Approach

### Layer 1: Snapshot Tarballs (Fastest Initial Clones)

**Observation**: `tar` is significantly faster than Git at populating a repository because:
- No pack negotiation overhead
- No delta resolution computation
- Single sequential read/write operation
- Can use fast compression (zstd)

**Approach**:
1. Cache server maintains full clones of upstream repositories
2. Generate daily tarballs of the full clone
3. Client downloads and extracts tarball, then runs `git fetch` to catch up

**Client-side workflow**:
```
# Instead of: git clone https://github.com/org/repo
cachew git clone https://github.com/org/repo
```

Under the hood:
1. Check if snapshot tarball exists for repo
2. Download and extract: curl ... | zstd -d | tar -xf -
3. Set remote URL to upstream (or through cache proxy)
4. git fetch to get any updates since snapshot
5. git checkout as normal

### Layer 2: Daily Bundles (Fallback for Non-Tarball Clients)

For clients that don't use the tarball option, daily bundles provide a simpler optimisation.

**Approach**:
- Generate one daily bundle containing all refs
- Cache server advertises bundle URI via protocol v2 `bundle-uri` capability
- Client cloning through cache proxy automatically fetches bundle first
- Git then negotiates remaining objects via normal protocol

### Layer 3: Git Protocol Proxy (Normal Fetches)

Proxy `git-upload-pack` requests, always serving from the local clone.

**Approach**:
- Cache server intercepts git protocol requests
- Always serves objects from local clone (never proxies to upstream)
- Local clone is kept fresh via periodic background fetches

**Cache Key Strategy**:

To cache packfile responses, normalize and hash the request:
```
cache_key = hash(repo_url, sorted(want_refs), sorted(have_refs))
```

**Normalization**:
- Sort want/have OIDs lexicographically
- Include repo identifier
- Optionally include filter spec (for partial clones)

**Example**:
```
wants: [abc123, def456, 789xyz]
haves: [111aaa, 222bbb]

normalized = "{host}/{path}:wants=789xyz,abc123,def456:haves=111aaa,222bbb"
cache_key = sha256(normalized)
```

**Benefits**:
- Zero load on upstream for git protocol operations
- Multiple clients with same repo state get cache hits
- CI builds cloning same commit hit cache
- Works transparently with standard git

**Considerations**:
- Local clone freshness depends on background fetch interval
- May need to handle shallow clones separately

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│ Cache Server │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────────────────────┐ │
│ │ Full Clone │ │ Daily Generators │ │
│ │ Storage │───▶│ - Tarball snapshots (.tar.zst) │ │
│ │ │ │ - Bundle files (.bundle) │ │
│ │ /repos/ │ └─────────────────────────────────┘ │
│ │ {host}/{path} │ │ │
│ │ │ ▼ │
│ └────────┬────────┘ ┌─────────────────────────────────┐ │
│ │ │ Object Cache │ │
│ │ │ - Snapshots │ │
│ │ │ - Bundles │ │
│ └────────────▶│ - Packfile responses │ │
│ └─────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ HTTP Endpoints ││
│ │ ││
│ │ GET /git/{host}/{path}/snapshot.tar.zst ││
│ │ GET /git/{host}/{path}/bundle.bundle ││
│ │ POST /git/{host}/{path}/git-upload-pack ││
│ │ ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
```

### Client Options

**Option A: Wrapper Script** (`cachew-git`) - Recommended
- Intercepts `clone` command
- Downloads snapshot tarball, extracts, fetches updates
- Falls back to bundle-uri or cached git protocol

**Option B: Git Config Redirect**
- Configure `url.<base>.insteadOf` to redirect through cache
- Works with standard git commands
- Only benefits from protocol caching and bundles (no tarball support)

### Data Flow: Initial Clone (Tarball Client)

```
Client Cache Server Upstream
│ │ │
│ GET /snapshot.tar.zst │ │
│────────────────────────────▶│ │
│◀────────────────────────────│ (serve from cache) │
│ tar -xf │ │
│ │ │
│ git fetch (via cache) │ │
│────────────────────────────▶│ │
│ │ (cache lookup by │
│ │ hashed refs) │
│◀────────────────────────────│ │
```

### Data Flow: Normal Git Clone (Protocol Proxy)

```
Client Cache Server Upstream
│ │ │
│ git-upload-pack │ │
│ wants=[...] haves=[...] │ │
│────────────────────────────▶│ │
│ │ hash(wants, haves) │
│ │ cache lookup │
│ │ │
│ │ MISS: serve from local │
│ │ clone, cache response │
│◀────────────────────────────│ │
│ │ │
│ │ HIT: serve from cache │
│◀────────────────────────────│ │
```

## Implementation Plan

### Phase 1: Clone Management
1. Storage for full clones on cache server
2. Background job to `git fetch` from upstream periodically
3. Track last-fetched time per repository

### Phase 2: Snapshot Tarballs
1. Daily tarball generation from full clones
2. HTTP endpoint to serve snapshots
3. Client wrapper script (`cachew-git clone`)

### Phase 3: Git Protocol Proxy
1. Implement `git-upload-pack` endpoint
2. Parse wants/haves from request
3. Normalize and hash for cache key
4. Serve from local clone, cache packfile responses

### Phase 4: Bundle Support
1. Daily bundle generation from full clones
2. HTTP endpoint to serve bundle file
3. Advertise bundle-uri in protocol v2 capability during git-upload-pack

## Key Decisions

### Git Version Requirement
- Git 2.38+ for bundle-uri support
- Client wrapper works with any Git version

### Compression
- Tarballs: zstd (fast decompression, good ratio)
- Bundles: Git's native pack compression

### Cache Keys
- Snapshots: `git/{host}/{path}/snapshot-{date}.tar.zst`
- Bundles: `git/{host}/{path}/bundle-{date}.bundle`
- Packfiles: `git/{host}/{path}/pack-{hash(wants,haves)}.pack`

### Freshness
- Bare clone fetch: every 5-15 minutes (configurable)
- Snapshots: generated daily
- Bundles: generated daily
- Packfiles: long TTL (immutable for given inputs)

### Storage
- Full clones: local filesystem (fast access needed)
- Everything else: cache backend (tiered)

## Risks and Mitigations

| Risk | Mitigation |
|------|------------|
| Stale snapshots | Always `git fetch` after snapshot extract |
| Large repositories | Consider blobless partial clone support later |
| Upstream auth | Pass through credentials or use deployment keys |
| Storage growth | Retention policies, single clone per repo |
| Packfile cache misses | Most CI builds have identical state = high hit rate |

## References

- [Git Bundle-URI Documentation](https://git-scm.com/docs/bundle-uri)
- [Git Protocol v2](https://git-scm.com/docs/protocol-v2)
- [Git Pack Protocol](https://git-scm.com/docs/pack-protocol)
1 change: 1 addition & 0 deletions internal/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ import (
"github.com/block/cachew/internal/cache"
"github.com/block/cachew/internal/logging"
"github.com/block/cachew/internal/strategy"
_ "github.com/block/cachew/internal/strategy/git" // Register git strategy
)

type loggingMux struct {
Expand Down
101 changes: 101 additions & 0 deletions internal/strategy/git/backend.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
package git

import (
"context"
"log/slog"
"net/http"
"net/http/cgi" //nolint:gosec // CVE-2016-5386 only affects Go < 1.6.3
"os"
"os/exec"
"path/filepath"

"github.com/alecthomas/errors"

"github.com/block/cachew/internal/httputil"
"github.com/block/cachew/internal/logging"
)

// serveFromBackend serves a Git request using git http-backend.
func (s *Strategy) serveFromBackend(w http.ResponseWriter, r *http.Request, c *clone) {
logger := logging.FromContext(r.Context())

gitPath, err := exec.LookPath("git")
if err != nil {
httputil.ErrorResponse(w, r, http.StatusInternalServerError, "git not found in PATH")
return
}

absRoot, err := filepath.Abs(s.config.MirrorRoot)
if err != nil {
httputil.ErrorResponse(w, r, http.StatusInternalServerError, "failed to get absolute path")
return
}

// Build the path that git http-backend expects
host := r.PathValue("host")
pathValue := r.PathValue("path")

// git http-backend expects the path as-is: /host/repo.git/info/refs
backendPath := "/" + host + "/" + pathValue

logger.DebugContext(r.Context(), "Serving with git http-backend",
slog.String("original_path", r.URL.Path),
slog.String("backend_path", backendPath),
slog.String("clone_path", c.path))

handler := &cgi.Handler{
Path: gitPath,
Args: []string{"http-backend"},
Env: []string{
"GIT_PROJECT_ROOT=" + absRoot,
"GIT_HTTP_EXPORT_ALL=1",
"PATH=" + os.Getenv("PATH"),
},
}

// Modify request for http-backend
r2 := r.Clone(r.Context())
r2.URL.Path = backendPath

handler.ServeHTTP(w, r2)
}

// executeClone performs a git clone --bare --mirror operation.
func (s *Strategy) executeClone(ctx context.Context, c *clone) error {
logger := logging.FromContext(ctx)

if err := os.MkdirAll(filepath.Dir(c.path), 0o750); err != nil {
return errors.Wrap(err, "create clone directory")
}

// #nosec G204 - c.upstreamURL and c.path are controlled by us
cmd := exec.CommandContext(ctx, "git", "clone", "--bare", "--mirror", c.upstreamURL, c.path)
output, err := cmd.CombinedOutput()
if err != nil {
logger.ErrorContext(ctx, "git clone failed",
slog.String("error", err.Error()),
slog.String("output", string(output)))
return errors.Wrap(err, "git clone")
}

logger.DebugContext(ctx, "git clone succeeded", slog.String("output", string(output)))
return nil
}

// executeFetch performs a git fetch --all operation.
func (s *Strategy) executeFetch(ctx context.Context, c *clone) error {
logger := logging.FromContext(ctx)

// #nosec G204 - c.path is controlled by us
cmd := exec.CommandContext(ctx, "git", "-C", c.path, "fetch", "--all")
output, err := cmd.CombinedOutput()
if err != nil {
logger.ErrorContext(ctx, "git fetch failed",
slog.String("error", err.Error()),
slog.String("output", string(output)))
return errors.Wrap(err, "git fetch")
}

logger.DebugContext(ctx, "git fetch succeeded", slog.String("output", string(output)))
return nil
}
Loading