-
-
Notifications
You must be signed in to change notification settings - Fork 0
feat: implement Phase 2 failure detection with heartbeat sampling and hint caps #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… hint caps - Add heartbeat peer sampling with configurable size (WithDistHeartbeatSample) - Implement node state transition metrics (suspect/dead counters) - Add global hint queue caps by count (WithDistHintMaxTotal) and bytes (WithDistHintMaxBytes) - Track membership version for cluster state changes - Expose membership snapshot API with state distribution - Add comprehensive test coverage for failure recovery, hint caps, and sampling - Update documentation to reflect Phase 2 completion status - Refactor hint replay logic for better maintainability - Add approximate byte accounting for queued hints with new metrics This completes the experimental failure detection system outlined in the roadmap Phase 2, providing better scalability through sampling and resource protection via global hint limits.
|
Running Code Quality on PRs by uploading data to Trunk will soon be removed. You can still run checks on your PRs using trunk-action - see the migration guide for more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements Phase 2 of the experimental failure detection system, adding heartbeat peer sampling, node state transition tracking, and global hint queue limits for better scalability and resource protection.
- Implements configurable heartbeat sampling to probe random peers instead of all peers each tick
- Adds global hint queue caps by count and bytes with metrics for dropped hints due to limits
- Tracks membership version changes and exposes node state transitions (suspect/dead counters)
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/backend/dist_memory.go | Core implementation of heartbeat sampling, global hint caps, and state transition metrics |
| internal/cluster/membership.go | Add membership versioning for cluster state change tracking |
| internal/cluster/version.go | New atomic version tracker for membership changes |
| tests/ | Comprehensive test coverage for failure recovery, hint caps, and heartbeat sampling |
| ROADMAP.md | Update Phase 2 status to completed |
| README.md | Document new features and configuration options |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| func (dm *DistMemory) approxHintSize(item *cache.Item) int64 { // receiver retained for symmetry; may use config later | ||
| _ = dm // acknowledge receiver intentionally (satisfy lint under current rule set) |
Copilot
AI
Aug 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comment and unused receiver acknowledgment is unnecessary. Consider making this a standalone function or remove the receiver if it's not needed for symmetry.
| func (dm *DistMemory) approxHintSize(item *cache.Item) int64 { // receiver retained for symmetry; may use config later | |
| _ = dm // acknowledge receiver intentionally (satisfy lint under current rule set) | |
| func approxHintSize(item *cache.Item) int64 { |
| // Package tests provides shared test helpers (duplicate directory retained to appease earlier imports if any). | ||
| package tests | ||
|
|
||
| // (File intentionally left empty after consolidation of helpers.) |
Copilot
AI
Aug 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This empty file with a comment about consolidation suggests incomplete cleanup. Consider removing this file entirely if it's no longer needed.
| // Package tests provides shared test helpers (duplicate directory retained to appease earlier imports if any). | |
| package tests | |
| // (File intentionally left empty after consolidation of helpers.) |
| if ver < 3 { // initial upserts already increment version; tolerate timing variance | ||
| t.Fatalf("expected membership version >=4, got %v", verAny) |
Copilot
AI
Aug 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The magic number 3 in the version check lacks clear justification. Consider calculating the expected minimum version based on the number of operations or using a named constant.
| if ver < 3 { // initial upserts already increment version; tolerate timing variance | |
| t.Fatalf("expected membership version >=4, got %v", verAny) | |
| if ver < initialUpserts { // initial upserts already increment version; tolerate timing variance | |
| t.Fatalf("expected membership version >=%d, got %v", initialUpserts, verAny) |
This completes the experimental failure detection system outlined in the roadmap Phase 2, providing better scalability through sampling and resource protection via global hint limits.