Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit/golangci-lint-hook
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ hook() {
pushd "${root_dir}" || exit

echo "Running golangci-lint..."
golangci-lint run ./... || exit 1
golangci-lint run --build-tags test ./... || exit 1

popd >/dev/null || exit
}
Expand Down
2 changes: 1 addition & 1 deletion .pre-commit/unit-test-hook
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ hook() {

# run the pre-commit hook
pushd "${root_dir}" || exit
go test -v -cover ./... || exit 1
go test -tags test -v -timeout 5m -cover ./... || exit 1
popd >/dev/null || exit
}

Expand Down
22 changes: 11 additions & 11 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ GITVERSION_NOT_INSTALLED = "gitversion is not installed: https://github.com/GitT


test:
go test -v -timeout 5m -cover ./...
go test -tags test -v -timeout 5m -cover ./...

# bench runs the benchmark tests in the benchmark subpackage of the tests package.
bench:
Expand Down Expand Up @@ -44,14 +44,14 @@ prepare-toolchain:
$(call check_command_exists,staticcheck) || go install honnef.co/go/tools/cmd/staticcheck@latest

@echo "Checking if pre-commit is installed..."
pre-commit --version || (echo "pre-commit is not installed, install it with 'pip install pre-commit'" && exit 1)

@echo "Initializing pre-commit..."
pre-commit validate-config || pre-commit install && pre-commit install-hooks

@echo "Installing pre-commit hooks..."
pre-commit install
pre-commit install-hooks
pre-commit --version >/dev/null 2>&1 || echo "pre-commit not found; skipping hook installation (optional)"
@if command -v pre-commit >/dev/null 2>&1; then \
echo "Initializing pre-commit..."; \
pre-commit validate-config || pre-commit install && pre-commit install-hooks; \
echo "Installing pre-commit hooks..."; \
pre-commit install; \
pre-commit install-hooks; \
fi


lint: prepare-toolchain
Expand All @@ -64,10 +64,10 @@ lint: prepare-toolchain
gofumpt -l -w ${GOFILES_NOVENDOR}

@echo "\nRunning staticcheck..."
staticcheck ./...
staticcheck -tags test ./...

@echo "\nRunning golangci-lint $(GOLANGCI_LINT_VERSION)..."
golangci-lint run --fix -v ./......
golangci-lint run --fix -v --build-tags test ./...

# check_command_exists is a helper function that checks if a command exists.
define check_command_exists
Expand Down
51 changes: 48 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# HyperCache

Check failure on line 1 in README.md

View check run for this annotation

Trunk.io / Trunk Check

prettier

Incorrect formatting, autoformat by running 'trunk fmt'

[![Go](https://github.com/hyp3rd/hypercache/actions/workflows/go.yml/badge.svg)][build-link] [![CodeQL](https://github.com/hyp3rd/hypercache/actions/workflows/codeql.yml/badge.svg)][codeql-link] [![golangci-lint](https://github.com/hyp3rd/hypercache/actions/workflows/golangci-lint.yml/badge.svg)][golangci-lint-link]

Expand Down Expand Up @@ -234,9 +234,10 @@
| Merkle anti-entropy | Implemented (pull-based) |
| Merkle performance metrics | Implemented (fetch/build/diff nanos) |
| Remote-only key enumeration fallback | Implemented with optional cap (`WithDistListKeysCap`) |
| Delete semantics (tombstones) | Implemented (no compaction yet) |
| Tombstone compaction / TTL | Planned |
| Quorum read/write consistency | Partially scaffolded (consistency levels enum) |
| Delete semantics (tombstones) | Implemented |
| Tombstone compaction / TTL | Implemented |
| Quorum read consistency | Implemented |
| Quorum write consistency | Implemented (acks enforced) |
| Failure detection / heartbeat | Experimental heartbeat present |
| Membership changes / dynamic rebalancing | Not yet |
| Network transport (HTTP partial) | Basic HTTP management + fetch merkle/keys; full RPC TBD |
Expand Down Expand Up @@ -265,6 +266,50 @@

Note: DistMemory is not a production distributed cache; it is a stepping stone towards a networked, failure‑aware implementation.

#### Consistency & Quorum Semantics

DistMemory currently supports three consistency levels configurable independently for reads and writes:

- ONE: Return after the primary (or first reachable owner) succeeds.
- QUORUM: Majority of owners (floor(R/2)+1) must acknowledge.
- ALL: Every owner must acknowledge; any unreachable replica causes failure.

Required acknowledgements are computed at runtime from the ring's current replication factor. For writes, the primary applies locally then synchronously fans out to remaining owners; for reads, it queries owners until the required number of successful responses is achieved (promoting next owner if a primary is unreachable). Read‑repair occurs when a later owner returns a newer version than the local primary copy.

#### Hinted Handoff

When a replica is unreachable during a write, a hint (deferred write) is enqueued locally keyed by the target node ID. Hints have a TTL (`WithDistHintTTL`) and are replayed on an interval (`WithDistHintReplayInterval`). Limits can be applied per node (`WithDistHintMaxPerNode`). Expired hints are dropped; delivered hints increment replay counters. Metrics exposed via the management endpoint allow monitoring queued, replayed, expired, and dropped hints.

Test helper methods for forcing a replay cycle (`StartHintReplayForTest`, `ReplayHintsForTest`, `HintedQueueSize`) are compiled only under the `test` build tag to keep production binaries clean.

To run tests that rely on these helpers:

```bash
go test -tags test ./...
```

#### Build Tags

The repository uses a `//go:build test` tag to include auxiliary instrumentation and helpers exclusively in test builds (e.g. hinted handoff queue inspection). Production builds omit these symbols automatically.

#### Metrics Snapshot

The `/dist/metrics` endpoint (and `DistMemory.Metrics()` API) expose counters for forwarding operations, replica fan‑out, read‑repair, hinted handoff lifecycle, quorum write attempts/acks/failures, Merkle sync timings, tombstone activity, and heartbeat probes. These are reset only on process restart.

#### Future Evolution

Planned enhancements toward a production‑grade distributed backend include:

- Real network transport (HTTP/JSON → gRPC) for data plane operations.
- Gossip‑based membership & failure detection (alive/suspect/dead) with automatic ring rebuild.
- Rebalancing & key range handoff on join/leave events.
- Incremental & adaptive anti‑entropy (Merkle diff scheduling, deletions reconciliation).
- Advanced versioning (hybrid logical clocks or vector clocks) and conflict resolution strategies.
- Client library for direct owner routing (avoiding extra network hops).
- Optional compression, TLS/mTLS security, auth middleware.

Until these land, DistMemory should be treated as an experimental playground rather than a fault‑tolerant cluster.

Examples can be too broad for a readme, refer to the [examples](./__examples/README.md) directory for a more comprehensive overview.

## License
Expand Down
169 changes: 169 additions & 0 deletions ROADMAP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
# Distributed Backend Roadmap

Check failure on line 1 in ROADMAP.md

View check run for this annotation

Trunk.io / Trunk Check

prettier

Incorrect formatting, autoformat by running 'trunk fmt'

This document tracks the evolution of the experimental `DistMemory` backend into a production‑grade multi‑node cluster in incremental, reviewable phases.

## Guiding Principles

- **Incremental**: Ship thin vertical slices; keep feature flags for rollback.
- **Deterministic**: Prefer explicit ownership calculations & version ordering.
- **Observable**: Every subsystem emits metrics/logs before being relied upon.
- **Fail Safe**: Degraded components (one node down) should not cascade failures.
- **Pluggable**: Transport, membership, serialization, and security are replaceable.

## Current State (Baseline)

Implemented:

- Consistent hashing ring (virtual nodes) + static membership.
- Replication factor & read/write consistency (ONE / QUORUM / ALL) with quorum enforcement.
- Versioning (Lamport-like counter) and read‑repair.
- Hinted handoff (queue TTL, replay interval, metrics, test-only helpers behind `//go:build test`).
- Tombstones with TTL + compaction; anti-resurrection semantics.
- Merkle tree anti‑entropy (build + diff + pull) with metrics.
- Management endpoints (`/cluster/*`, `/dist/*`, `/internal/merkle`, `/internal/keys`).
- Metrics: quorum attempts/failures, replication fan‑out, hinted handoff lifecycle, merkle timings, tombstone counts.

Gaps:

- No real network RPC for data path (only in-process transport).
- Static membership (no gossip / dynamic join-leave / failure states).
- No key rebalancing / ownership transfer on membership change.
- Anti-entropy incremental scheduling & delete reconciliation tests incomplete.
- No client SDK for direct routing.
- Limited chaos/failure injection; no latency/fault simulation.
- Security (TLS/auth) absent.
- Persistence & durability out of scope (future consideration).

## Phase Overview

### Phase 1: Data Plane & DistConfig (Weeks 1–2)

Deliverables:

- `DistConfig` (NodeID, BindAddr, AdvertiseAddr, Seeds, ReplicationFactor, VirtualNodes, Hint settings, Consistency levels).
- HTTP JSON RPC endpoints: `POST /internal/set`, `GET /internal/get`, `DELETE /internal/del`.
- HTTP implementation of `DistTransport` (keep current in-process implementation for tests).
- Refactor DistMemory forwarding to use transport abstraction seamlessly.
- Multi-process integration test (3 nodes) verifying quorum & hint replay.

Metrics:

- Add latency histograms for set/get/del.

Success Criteria:

- Cross-process quorum & hinted handoff tests pass without code changes except wiring config.

### Phase 2: Failure Detection & Dynamic Membership (Weeks 3–4)

Deliverables:

- Gossip/heartbeat loop (k random peers, interval configurable).
- Node state transitions: alive → suspect → dead (timeouts & confirmations).
- Ring rebuild on state change (exclude dead nodes, retain for hint replay until TTL expiry).
- Global hint queue caps (count + bytes) with drop metrics.

Metrics:

- Heartbeat successes/failures, suspect/dead counters, membership version.

Success Criteria:

- Simulated node failure triggers quorum degradation & hinting; recovery drains hints.

### Phase 3: Rebalancing & Key Transfer (Weeks 5–6)

Deliverables:

- Ownership diff algorithm (old vs new ring).
- Batched key transfer (scan source owners; preserve versions & tombstones).
- Rate limiting & concurrent batch cap.
- Join/leave integration tests (distribution variance <10% of ideal after settle).

Metrics:

- Keys transferred, transfer duration, throttle events.

Success Criteria:

- Newly joined node receives expected shard of data; leaves do not resurrect deleted keys.

### Phase 4: Anti-Entropy Hardening (Weeks 7–8)

Deliverables:

- Incremental / windowed Merkle scheduling with adaptive backoff.
- Tombstone & delete reconciliation test matrix.
- Read-repair batching + metric for repairs applied.
- Optional fast-path hash (rolling / bloom) for clean shard skip.

Success Criteria:

- Injected divergences converge within configured interval (< target).

### Phase 5: Client SDK & Performance (Weeks 9–10)

Deliverables:

- Go client: seed discovery, ring bootstrap, direct owner hashing, parallel fan-out for QUORUM/ALL.
- Benchmarks: proxy path vs client-direct (latency reduction target >15%).
- Optional message serialization toggle (JSON/msgpack).

Success Criteria:

- QUORUM Get/Set p95 latency improved vs proxy path.

### Phase 6: Security & Observability (Weeks 11–12)

Deliverables:

- TLS enablement (cert config); optional mTLS.
- Pluggable auth (HMAC/Bearer) middleware for data RPC.
- OpenTelemetry spans: Set, Get, ReplicaFanout, HintReplay, MerkleSync, Rebalance.
- Structured logging (node id, trace id, op fields).

Success Criteria:

- End-to-end trace present for a Set with replication fan-out.

### Phase 7: Resilience & Chaos (Weeks 13–14)

Deliverables:

- Fault injection hooks (drop %, delay, partition simulation inside transport).
- Chaos tests (latency spikes, packet loss, partial partitions).
- Long-running stability test (memory growth bounded; no unbounded queues).

Success Criteria:

- Under 10% injected packet loss, quorum failure rate within acceptable SLO (<2% for QUORUM writes).

## Cross-Cutting Items

- Documentation updates per phase (`README`, `docs/distributed.md`).
- CI enhancements: integration cluster spin-up, race detector, benchmarks.
- Metric name stability & versioning (prefix `hypercache_dist_`).
- Feature flags / env toggles for new subsystems (gossip, rebalancing, anti-entropy scheduling).

## KPIs

| KPI | Target |
|-----|--------|
| QUORUM Set p95 (3-node HTTP) | < 3x in-process baseline |
| QUORUM Get p95 | < 2x in-process baseline |
| Hint Drain Time (single node outage 5m) | < 2m after recovery |
| Data Imbalance Post-Join | < 10% variance from ideal |
| Divergence Convergence Time | < configured sync interval |
| Quorum Failure Rate (1 node down, QUORUM) | < 2% |

## Immediate Next Actions (Phase 1 Kickoff)

1. Create `distconfig.go` with DistConfig struct + option to load into DistMemory.
2. Define HTTP transport interface & request/response schemas.
3. Implement server handlers (reuse existing serialization & version logic).
4. Add integration test harness launching 3 HTTP nodes (ephemeral ports) and exercising Set/Get with QUORUM & hinted handoff.
5. Introduce latency histograms (atomic moving buckets or exposable summary) for RPC.

---

This roadmap will evolve; adjustments captured via PR edits referencing this file.
1 change: 1 addition & 0 deletions cspell.config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ words:
- gerr
- gitversion
- GITVERSION
- goarch
- goccy
- gochecknoglobals
- gofiber
Expand Down
35 changes: 35 additions & 0 deletions internal/dist/config.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
package dist

import "time"

const (
defaultVirtualNodes = 64
)

// Config holds cluster node + distributed settings for DistMemory (and future networked backends).
type Config struct {
NodeID string
BindAddr string // address the node listens on for RPC
AdvertiseAddr string // address shared with peers (may differ from BindAddr)
Seeds []string
Replication int
VirtualNodes int
ReadConsistency int // maps to backend.ConsistencyLevel
WriteConsistency int
HintTTL time.Duration
HintReplay time.Duration
HintMaxPerNode int
}

// Defaults returns a Config with safe initial values.
func Defaults() Config { //nolint:ireturn
return Config{
Replication: 1,
VirtualNodes: defaultVirtualNodes,
ReadConsistency: 0, // ONE
WriteConsistency: 1, // QUORUM (match backend default)
HintTTL: 0,
HintReplay: 0,
HintMaxPerNode: 0,
}
}
2 changes: 2 additions & 0 deletions internal/dist/transport.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
// Package dist currently holds configuration primitives. Transport was moved to pkg/backend.
package dist
Loading
Loading