Skip to content

Conversation

@hyp3rd
Copy link
Owner

@hyp3rd hyp3rd commented Aug 24, 2025

  • Add heartbeat peer sampling with configurable size (WithDistHeartbeatSample)
  • Implement node state transition metrics (suspect/dead counters)
  • Add global hint queue caps by count (WithDistHintMaxTotal) and bytes (WithDistHintMaxBytes)
  • Track membership version for cluster state changes
  • Expose membership snapshot API with state distribution
  • Add comprehensive test coverage for failure recovery, hint caps, and sampling
  • Update documentation to reflect Phase 2 completion status
  • Refactor hint replay logic for better maintainability
  • Add approximate byte accounting for queued hints with new metrics

This completes the experimental failure detection system outlined in the roadmap Phase 2, providing better scalability through sampling and resource protection via global hint limits.

… hint caps

- Add heartbeat peer sampling with configurable size (WithDistHeartbeatSample)
- Implement node state transition metrics (suspect/dead counters)
- Add global hint queue caps by count (WithDistHintMaxTotal) and bytes (WithDistHintMaxBytes)
- Track membership version for cluster state changes
- Expose membership snapshot API with state distribution
- Add comprehensive test coverage for failure recovery, hint caps, and sampling
- Update documentation to reflect Phase 2 completion status
- Refactor hint replay logic for better maintainability
- Add approximate byte accounting for queued hints with new metrics

This completes the experimental failure detection system outlined in the roadmap
Phase 2, providing better scalability through sampling and resource protection
via global hint limits.
Copilot AI review requested due to automatic review settings August 24, 2025 15:33
@hyp3rd hyp3rd changed the title feat: implement Phase 2 failure detection with heartbeat sampling and… feat: implement Phase 2 failure detection with heartbeat sampling and hint caps Aug 24, 2025
@trunk-io
Copy link

trunk-io bot commented Aug 24, 2025

Running Code Quality on PRs by uploading data to Trunk will soon be removed. You can still run checks on your PRs using trunk-action - see the migration guide for more information.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements Phase 2 of the experimental failure detection system, adding heartbeat peer sampling, node state transition tracking, and global hint queue limits for better scalability and resource protection.

  • Implements configurable heartbeat sampling to probe random peers instead of all peers each tick
  • Adds global hint queue caps by count and bytes with metrics for dropped hints due to limits
  • Tracks membership version changes and exposes node state transitions (suspect/dead counters)

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pkg/backend/dist_memory.go Core implementation of heartbeat sampling, global hint caps, and state transition metrics
internal/cluster/membership.go Add membership versioning for cluster state change tracking
internal/cluster/version.go New atomic version tracker for membership changes
tests/ Comprehensive test coverage for failure recovery, hint caps, and heartbeat sampling
ROADMAP.md Update Phase 2 status to completed
README.md Document new features and configuration options

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +1691 to +1692
func (dm *DistMemory) approxHintSize(item *cache.Item) int64 { // receiver retained for symmetry; may use config later
_ = dm // acknowledge receiver intentionally (satisfy lint under current rule set)
Copy link

Copilot AI Aug 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment and unused receiver acknowledgment is unnecessary. Consider making this a standalone function or remove the receiver if it's not needed for symmetry.

Suggested change
func (dm *DistMemory) approxHintSize(item *cache.Item) int64 { // receiver retained for symmetry; may use config later
_ = dm // acknowledge receiver intentionally (satisfy lint under current rule set)
func approxHintSize(item *cache.Item) int64 {

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +4
// Package tests provides shared test helpers (duplicate directory retained to appease earlier imports if any).
package tests

// (File intentionally left empty after consolidation of helpers.)
Copy link

Copilot AI Aug 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This empty file with a comment about consolidation suggests incomplete cleanup. Consider removing this file entirely if it's no longer needed.

Suggested change
// Package tests provides shared test helpers (duplicate directory retained to appease earlier imports if any).
package tests
// (File intentionally left empty after consolidation of helpers.)

Copilot uses AI. Check for mistakes.
Comment on lines +72 to +73
if ver < 3 { // initial upserts already increment version; tolerate timing variance
t.Fatalf("expected membership version >=4, got %v", verAny)
Copy link

Copilot AI Aug 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The magic number 3 in the version check lacks clear justification. Consider calculating the expected minimum version based on the number of operations or using a named constant.

Suggested change
if ver < 3 { // initial upserts already increment version; tolerate timing variance
t.Fatalf("expected membership version >=4, got %v", verAny)
if ver < initialUpserts { // initial upserts already increment version; tolerate timing variance
t.Fatalf("expected membership version >=%d, got %v", initialUpserts, verAny)

Copilot uses AI. Check for mistakes.
@hyp3rd hyp3rd merged commit 4bede19 into main Aug 24, 2025
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants