Conversation
Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
📝 WalkthroughWalkthroughThe pull request extends remove operations across the mooncake store API by adding a force parameter to three key methods: Remove, RemoveByRegex, and RemoveAll. The force parameter (default false) bypasses lease and replication safety checks, threading through Python bindings, client implementations, service layers, and RPC wrappers down to the master service logic. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
mooncake-store/src/master_service.cpp (1)
1260-1381: Force removals orphan replication/processing state, causing logged errors in background cleanup.When
force=true,Remove/RemoveByRegex/RemoveAllerase metadata without clearingreplication_tasksorprocessing_keys. This leaves stale task entries that background cleanup encounters and logs as errors ("Key X was removed with ongoing replication task"). Clear these structures alongside metadata erasure when force is enabled.For
Remove(), callaccessor.EraseReplicationTask()andaccessor.EraseFromProcessing()whenforce=truebefore erasing metadata. ForRemoveByRegex()andRemoveAll(), explicitly erase keys from bothshard->replication_tasksandshard->processing_keysbefore erasing metadata.
🧹 Nitpick comments (1)
mooncake-store/src/master_client.cpp (1)
530-539: Inconsistent log parameter name.The log message uses
key=but the parameter represents a regex pattern (str). This should beregex=for consistency with theWrappedMasterService::RemoveByRegexlogging shown in the relevant snippets.Suggested fix
tl::expected<long, ErrorCode> MasterClient::RemoveByRegex( const std::string& str, bool force) { ScopedVLogTimer timer(1, "MasterClient::RemoveByRegex"); - timer.LogRequest("key=", str, ", force=", force); + timer.LogRequest("regex=", str, ", force=", force); auto result = invoke_rpc<&WrappedMasterService::RemoveByRegex, long>(str, force);
There was a problem hiding this comment.
Pull request overview
This PR adds a force flag to the remove APIs (Remove, RemoveByRegex, and RemoveAll) in the Mooncake Store. The force flag allows bypassing safety checks (lease expiration, replica completion, and replication task checks) to enable forceful removal of objects even when they are in active use.
Changes:
- Added a boolean
forceparameter (default=false) to Remove, RemoveByRegex, and RemoveAll methods across all API layers - Added comprehensive test coverage for force removal scenarios including leased objects and processing objects
- Updated Python bindings to expose the force parameter with proper documentation
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| mooncake-store/tests/master_service_test.cpp | Adds 4 new test cases covering force removal scenarios: leased objects, regex removal, remove all, and processing objects |
| mooncake-store/src/master_service.cpp | Core implementation: adds force parameter and conditional checks that bypass safety validations when force=true |
| mooncake-store/src/rpc_service.cpp | RPC wrapper: propagates force parameter and adds it to request logging |
| mooncake-store/src/master_client.cpp | Master client: propagates force parameter and adds it to request logging |
| mooncake-store/src/client_service.cpp | Client service: propagates force parameter through to master client |
| mooncake-store/src/real_client.cpp | Real client implementation: adds force parameter to internal and public methods |
| mooncake-store/src/dummy_client.cpp | Dummy client implementation: adds force parameter and propagates to real client |
| mooncake-store/include/master_service.h | Header: adds force parameter with default value and updates documentation |
| mooncake-store/include/rpc_service.h | Header: adds force parameter with default value to RPC service interface |
| mooncake-store/include/master_client.h | Header: adds force parameter with default value and updates documentation |
| mooncake-store/include/client_service.h | Header: adds force parameter with default value and updates documentation |
| mooncake-store/include/real_client.h | Header: adds force parameter with default value to both public and internal methods |
| mooncake-store/include/dummy_client.h | Header: adds force parameter with default value |
| mooncake-store/include/pyclient.h | Header: adds force parameter with default value to virtual interface |
| mooncake-integration/store/store_py.cpp | Python bindings: exposes force parameter with proper argument naming and documentation strings |
| if (!force && | ||
| metadata_shards_[i].replication_tasks.contains(it->first)) { | ||
| LOG(WARNING) << "key=" << it->first | ||
| << ", matched by regex, but has replication " | ||
| "task. Skipping removal."; |
There was a problem hiding this comment.
When force removal bypasses safety checks in RemoveByRegex, consider adding logging to indicate this occurred for each object. The current logging at line 1332-1333 doesn't distinguish between normal and force removals. Adding information about which safety checks were bypassed (e.g., active lease, incomplete replicas, or replication tasks) would aid in troubleshooting and auditing.
| if (force || (it->second.IsLeaseExpired(now) && | ||
| it->second.AllReplicas(&Replica::fn_is_completed) && | ||
| !shard->replication_tasks.contains(it->first))) { | ||
| auto mem_rep_count = | ||
| it->second.CountReplicas(&Replica::fn_is_memory_replica); | ||
| total_freed_size += it->second.size * mem_rep_count; |
There was a problem hiding this comment.
When force=true is used in RemoveAll, consider logging additional information about the operation. The current log at line 1377-1379 doesn't indicate whether force was used or how many objects were removed despite having active leases or incomplete states. Adding a force parameter to the log would help with auditing and troubleshooting: VLOG(1) << "action=remove_all_objects, force=" << force << ", removed_count=" << removed_count << ", total_freed_size=" << total_freed_size;
| if (!force && accessor.HasReplicationTask()) { | ||
| LOG(ERROR) << "key=" << key << ", error=object_has_replication_task"; | ||
| return tl::make_unexpected(ErrorCode::OBJECT_HAS_REPLICATION_TASK); | ||
| } |
There was a problem hiding this comment.
When force=true, the Remove method erases the object metadata without cleaning up any associated replication task (line 1286). This can leave orphaned replication tasks in the system that reference non-existent objects. Consider adding if (accessor.HasReplicationTask()) { accessor.EraseReplicationTask(); } before calling accessor.Erase() to properly clean up replication tasks even during force removal. This would prevent workers from fetching and attempting to execute tasks for objects that no longer exist.
| if (!force && | ||
| metadata_shards_[i].replication_tasks.contains(it->first)) { | ||
| LOG(WARNING) << "key=" << it->first | ||
| << ", matched by regex, but has replication " | ||
| "task. Skipping removal."; |
There was a problem hiding this comment.
When force=true, RemoveByRegex skips the replication task check but doesn't clean up any associated replication tasks before removing objects. This can leave orphaned replication tasks in the system. Consider adding cleanup code to erase replication tasks for matched keys when force=true: if (metadata_shards_[i].replication_tasks.contains(it->first)) { metadata_shards_[i].replication_tasks.erase(it->first); } before erasing the metadata. This would prevent workers from fetching tasks for objects that no longer exist.
| !shard->replication_tasks.contains(it->first))) { | ||
| auto mem_rep_count = | ||
| it->second.CountReplicas(&Replica::fn_is_memory_replica); | ||
| total_freed_size += it->second.size * mem_rep_count; |
There was a problem hiding this comment.
When force=true, RemoveAll skips the replication task check but doesn't clean up any associated replication tasks before removing objects. This can leave orphaned replication tasks in the system. Consider adding cleanup code to erase replication tasks for objects being removed when force=true: if (shard->replication_tasks.contains(it->first)) { shard->replication_tasks.erase(it->first); } before erasing the metadata. This would prevent workers from fetching tasks for objects that no longer exist.
| total_freed_size += it->second.size * mem_rep_count; | |
| total_freed_size += it->second.size * mem_rep_count; | |
| // When force=true, we may be removing objects that still have | |
| // outstanding replication tasks. Clean them up to avoid | |
| // leaving orphaned replication tasks. | |
| if (force && shard->replication_tasks.contains(it->first)) { | |
| shard->replication_tasks.erase(it->first); | |
| } |
| // Test force Remove with processing replicas (not complete) | ||
| TEST_F(MasterServiceTest, ForceRemoveProcessingObject) { | ||
| std::unique_ptr<MasterService> service_(new MasterService()); | ||
| [[maybe_unused]] const auto context = PrepareSimpleSegment(*service_); | ||
| const UUID client_id = generate_uuid(); | ||
|
|
||
| // Start a put but don't end it - object will be in PROCESSING state | ||
| std::string key = "processing_key"; | ||
| uint64_t slice_length = 1024; | ||
| ReplicateConfig config; | ||
| config.replica_num = 1; | ||
| auto put_start_result = | ||
| service_->PutStart(client_id, key, slice_length, config); | ||
| ASSERT_TRUE(put_start_result.has_value()); | ||
|
|
||
| // Normal remove should fail because replica is not ready | ||
| auto remove_result_no_force = service_->Remove(key, false); | ||
| EXPECT_FALSE(remove_result_no_force.has_value()); | ||
| EXPECT_EQ(ErrorCode::REPLICA_IS_NOT_READY, remove_result_no_force.error()); | ||
|
|
||
| // Force remove should succeed even with processing replica | ||
| auto remove_result_force = service_->Remove(key, true); | ||
| EXPECT_TRUE(remove_result_force.has_value()); | ||
|
|
||
| // Verify object is removed | ||
| auto get_result = service_->GetReplicaList(key); | ||
| EXPECT_FALSE(get_result.has_value()); | ||
| EXPECT_EQ(ErrorCode::OBJECT_NOT_FOUND, get_result.error()); | ||
| } | ||
|
|
There was a problem hiding this comment.
The test suite is missing coverage for force removal of objects with active replication tasks. Consider adding a test case that creates a copy or move task for an object, then attempts to remove it with force=true. This would verify that force removal works correctly (or expose the orphaned task issue mentioned in other comments) when replication tasks are present. The test should verify both that the object is removed and check the state of the replication task afterward.
Description
Type of Change
How Has This Been Tested?
Checklist
./scripts/code_format.shbefore submitting.Summary by CodeRabbit
Release Notes
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.