Skip to content

Latest commit

 

History

History
88 lines (65 loc) · 3.76 KB

File metadata and controls

88 lines (65 loc) · 3.76 KB

DiffKeeper: The Kubernetes Time Machine

CI License: Apache 2.0

DiffKeeper is a Black Box Flight Recorder for your containers.

It watches your application's filesystem in real-time and records every change. When a container crashes—or a CI test flakes—you can rewind the state to any exact moment and see exactly what happened.

Note: The earlier "stateful containers" design is archived. See Genesis & Pivot for the story.


The Problem: "Why did that test fail?"

You have a flaky test in CI. It fails 1 out of 50 times. You re-run the job, and it passes. You have no idea why.

  • Logs only show you what the application printed.
  • They don't show you that a config file was corrupted, a temp file was locked, or a binary was overwritten.

The Solution: Instant Replay

DiffKeeper uses eBPF to capture filesystem writes at line-rate and stores them in Pebble. Then it gives you a timeline so you never guess timestamps again.

1) Record a Session

Wrap your flaky test (or any command). Minimal overhead.

diffkeeper record --state-dir=/tmp/trace -- go test ./...

2) See the Timeline (no blindfolds)

List every write in order to pick the exact second to rewind:

diffkeeper timeline --state-dir=/tmp/trace
[00m:01s] WRITE    status.log (13B)
[00m:05s] WRITE    db.lock (6B)
[02m:14s] WRITE    status.log (22B)   <-- the failure

3) Export the Crash Site

Restore the filesystem to the moment of failure:

diffkeeper export --state-dir=/tmp/trace --out=./debug_fs --time="2m14s"

cd ./debug_fs and inspect files exactly as they existed at that moment.

Drop-in GitHub Action

No curl | sh snippets needed—use the composite action directly:

steps:
  - uses: actions/checkout@v4
  - name: Record flaky test
    uses: saworbit/diffkeeper@v1
    with:
      command: go test ./...
      state-dir: diffkeeper-trace

On failure the trace uploads as an artifact; you can run diffkeeper timeline to find the culprit write, then diffkeeper export to reconstruct it locally.

The "Flaky CI" Demo

Run the built-in demo to see the loop end-to-end:

diffkeeper record --state-dir=./trace -- go run ./demo/flaky-ci-test
diffkeeper timeline --state-dir=./trace
diffkeeper export --state-dir=./trace --out=./restored --time="2s"
cat ./restored/status.log  # ERROR: Connection Lost

Architecture

  • Engine: Pure Go + eBPF (CO-RE)
  • Storage: Pebble (LSM) for high-speed ingestion.
  • Diffing: bsdiff (binary patches) for efficient storage.

CI / Dogfooding

  • GitHub Actions (.github/workflows/ci.yml) runs unit/race tests, cross-platform builds, and a functional time-machine test that records a flaky script and verifies exports.
  • BoltDB-era workflows remain archived under docs/archive/v1-legacy/workflows/.

Requirements & Compatibility

  • Build Process: The recommended way to build the project is with Docker, which requires no local dependencies. Simply run make build-dockerized. For local builds, you will need Go, clang, and bpftool.
  • Runtime Privileges: The core recording feature requires sudo privileges on Linux to attach the eBPF probes to the kernel. The application will provide a clear error if run without them.
  • Cross-Platform Support: The high-performance eBPF monitoring is Linux-specific. The tool provides a fallback for macOS and Windows, but its behavior and performance will differ.

Getting Started

See the Quickstart to record, view the timeline, and export your first trace.