feat: implement Phase 2 failure detection with heartbeat sampling and hint caps #52

hyp3rd · 2025-08-24T15:33:32Z

Add heartbeat peer sampling with configurable size (WithDistHeartbeatSample)
Implement node state transition metrics (suspect/dead counters)
Add global hint queue caps by count (WithDistHintMaxTotal) and bytes (WithDistHintMaxBytes)
Track membership version for cluster state changes
Expose membership snapshot API with state distribution
Add comprehensive test coverage for failure recovery, hint caps, and sampling
Update documentation to reflect Phase 2 completion status
Refactor hint replay logic for better maintainability
Add approximate byte accounting for queued hints with new metrics

This completes the experimental failure detection system outlined in the roadmap Phase 2, providing better scalability through sampling and resource protection via global hint limits.

… hint caps - Add heartbeat peer sampling with configurable size (WithDistHeartbeatSample) - Implement node state transition metrics (suspect/dead counters) - Add global hint queue caps by count (WithDistHintMaxTotal) and bytes (WithDistHintMaxBytes) - Track membership version for cluster state changes - Expose membership snapshot API with state distribution - Add comprehensive test coverage for failure recovery, hint caps, and sampling - Update documentation to reflect Phase 2 completion status - Refactor hint replay logic for better maintainability - Add approximate byte accounting for queued hints with new metrics This completes the experimental failure detection system outlined in the roadmap Phase 2, providing better scalability through sampling and resource protection via global hint limits.

trunk-io · 2025-08-24T15:33:59Z

Running Code Quality on PRs by uploading data to Trunk will soon be removed. You can still run checks on your PRs using trunk-action - see the migration guide for more information.

Copilot

Pull Request Overview

This PR implements Phase 2 of the experimental failure detection system, adding heartbeat peer sampling, node state transition tracking, and global hint queue limits for better scalability and resource protection.

Implements configurable heartbeat sampling to probe random peers instead of all peers each tick
Adds global hint queue caps by count and bytes with metrics for dropped hints due to limits
Tracks membership version changes and exposes node state transitions (suspect/dead counters)

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
pkg/backend/dist_memory.go	Core implementation of heartbeat sampling, global hint caps, and state transition metrics
internal/cluster/membership.go	Add membership versioning for cluster state change tracking
internal/cluster/version.go	New atomic version tracker for membership changes
tests/	Comprehensive test coverage for failure recovery, hint caps, and heartbeat sampling
ROADMAP.md	Update Phase 2 status to completed
README.md	Document new features and configuration options

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-24T15:34:43Z

pkg/backend/dist_memory.go

+func (dm *DistMemory) approxHintSize(item *cache.Item) int64 { // receiver retained for symmetry; may use config later
+	_ = dm // acknowledge receiver intentionally (satisfy lint under current rule set)


[nitpick] The comment and unused receiver acknowledgment is unnecessary. Consider making this a standalone function or remove the receiver if it's not needed for symmetry.

Suggested change

func (dm *DistMemory) approxHintSize(item *cache.Item) int64 { // receiver retained for symmetry; may use config later

_ = dm // acknowledge receiver intentionally (satisfy lint under current rule set)

func approxHintSize(item *cache.Item) int64 {

Copilot · 2025-08-24T15:34:44Z

tests/testhelpers/key_owner_helper.go

+// Package tests provides shared test helpers (duplicate directory retained to appease earlier imports if any).
+package tests
+
+// (File intentionally left empty after consolidation of helpers.)


This empty file with a comment about consolidation suggests incomplete cleanup. Consider removing this file entirely if it's no longer needed.

Suggested change

// Package tests provides shared test helpers (duplicate directory retained to appease earlier imports if any).

package tests

// (File intentionally left empty after consolidation of helpers.)

Copilot · 2025-08-24T15:34:44Z

tests/hypercache_distmemory_heartbeat_sampling_test.go

+	if ver < 3 { // initial upserts already increment version; tolerate timing variance
+		t.Fatalf("expected membership version >=4, got %v", verAny)


[nitpick] The magic number 3 in the version check lacks clear justification. Consider calculating the expected minimum version based on the number of operations or using a named constant.

Suggested change

if ver < 3 { // initial upserts already increment version; tolerate timing variance

t.Fatalf("expected membership version >=4, got %v", verAny)

if ver < initialUpserts { // initial upserts already increment version; tolerate timing variance

t.Fatalf("expected membership version >=%d, got %v", initialUpserts, verAny)

Copilot AI review requested due to automatic review settings August 24, 2025 15:33

hyp3rd changed the title ~~feat: implement Phase 2 failure detection with heartbeat sampling and…~~ feat: implement Phase 2 failure detection with heartbeat sampling and hint caps Aug 24, 2025

Copilot AI reviewed Aug 24, 2025

View reviewed changes

hyp3rd merged commit 4bede19 into main Aug 24, 2025
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: implement Phase 2 failure detection with heartbeat sampling and hint caps #52

feat: implement Phase 2 failure detection with heartbeat sampling and hint caps #52

Uh oh!

hyp3rd commented Aug 24, 2025 •

edited

Loading

Uh oh!

trunk-io bot commented Aug 24, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 24, 2025

Uh oh!

Copilot AI Aug 24, 2025

Uh oh!

Copilot AI Aug 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		func (dm DistMemory) approxHintSize(item cache.Item) int64 { // receiver retained for symmetry; may use config later
		_ = dm // acknowledge receiver intentionally (satisfy lint under current rule set)

	func (dm DistMemory) approxHintSize(item cache.Item) int64 { // receiver retained for symmetry; may use config later
	_ = dm // acknowledge receiver intentionally (satisfy lint under current rule set)
	func approxHintSize(item *cache.Item) int64 {

		if ver < 3 { // initial upserts already increment version; tolerate timing variance
		t.Fatalf("expected membership version >=4, got %v", verAny)

Uh oh!

feat: implement Phase 2 failure detection with heartbeat sampling and hint caps #52

feat: implement Phase 2 failure detection with heartbeat sampling and hint caps #52

Uh oh!

Conversation

hyp3rd commented Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trunk-io bot commented Aug 24, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hyp3rd commented Aug 24, 2025 •

edited

Loading