Skip to content

[Store] feat: add cxl storage for mooncake store#8

Open
XucSh wants to merge 1 commit intomainfrom
pr1365-test
Open

[Store] feat: add cxl storage for mooncake store#8
XucSh wants to merge 1 commit intomainfrom
pr1365-test

Conversation

@XucSh
Copy link
Owner

@XucSh XucSh commented Jan 23, 2026

Description

Type of Change

  • Types
    • Bug fix
    • New feature
      • Transfer Engine
      • Mooncake Store
      • Mooncake EP
      • Integration
      • P2P Store
      • Python Wheel
    • Breaking change
    • CI/CD
    • Documentation update
    • Other

How Has This Been Tested?

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

Summary by CodeRabbit

  • New Features

    • Added CXL (Compute Express Link) memory allocation support with configuration options for enabling and configuring CXL memory pools
    • Introduced protocol-aware segment mounting with CXL protocol support for memory management
    • Added base address accessor for transfer operations
  • Tests

    • New CXL client integration test suite with put/get and batch operation coverage

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 23, 2026

📝 Walkthrough

Walkthrough

This pull request introduces CXL (Compute Express Link) memory allocation support to the Mooncake system. It adds a new CxlAllocationStrategy, extends configuration structures to propagate CXL settings, updates client and transport APIs for protocol awareness, and includes comprehensive integration tests for the new functionality.

Changes

Cohort / File(s) Summary
CXL Allocation Strategy
mooncake-store/include/allocation_strategy.h
New CxlAllocationStrategy class implementing single-segment CXL memory allocation with slice validation, preferred segment selection, and PROCESSING replica return on success.
Allocator & Buffer Management
mooncake-store/include/allocator.h, mooncake-store/src/allocator.cpp
Extended AllocatedBuffer with CXL protocol support: added change_to_cxl(), get_vaddr_from_cxl() methods, new segment_name_ and protocol fields, and descriptor reflection updates. Deallocation path now respects CXL protocol when freeing buffers.
Configuration Propagation
mooncake-store/include/master_config.h
Added enable_cxl, cxl_path, cxl_size fields across MasterConfig, MasterServiceSupervisorConfig, WrappedMasterServiceConfig, MasterServiceConfig, and MasterServiceConfigBuilder with corresponding getters/setters and fluent builder methods.
Client Service Protocol Support
mooncake-store/include/client_service.h, mooncake-store/src/client_service.cpp
MountSegment() signature now accepts optional protocol parameter (default "tcp"); added GetBaseAddr() accessor; constructor extended to store protocol_ member. Transport initialization includes CXL protocol branching; replication config derives client_cfg for CXL protocol.
Master Service Integration
mooncake-store/include/master_service.h, mooncake-store/src/master_service.cpp
Added cxl_path_, cxl_size_, enable_cxl_ data members; removed ResolvePath() method. Conditional allocation strategy selection: uses CxlAllocationStrategy when enable_cxl is true, otherwise RandomAllocationStrategy.
Segment Manager & CXL Allocator
mooncake-store/include/segment.h, mooncake-store/src/segment.cpp
Constructor now accepts enable_cxl parameter; added initializeCxlAllocator() public method and cxl_global_allocator_ member. MountSegment() includes CXL-specific path skipping normal allocation checks; CommitUnmountSegment() handles CXL segment capacity tracking.
Type Definitions & Constants
mooncake-store/include/types.h
Added DEFAULT_CXL_PATH, DEFAULT_CXL_BASE, DEFAULT_CXL_SIZE constants; extended Segment struct with protocol field and updated reflection macro.
CLI Configuration & Logging
mooncake-store/src/master.cpp
CLI flags for enable_cxl, cxl_path, cxl_size added; populates MasterConfig from defaults and command-line arguments; startup logging extended with CXL settings.
Real Client CXL Path
mooncake-store/src/real_client.cpp
Added protocol-conditional branching in setup_internal(): CXL path reads MC_CXL_DEV_SIZE environment variable and mounts single CXL segment; non-CXL path preserves original multi-segment mounting loop with hugepage/allocator-backed memory.
Transfer Engine Base Address Accessor
mooncake-transfer-engine/include/{transfer_engine,transfer_engine_impl,multi_transport}.h, mooncake-transfer-engine/src/{transfer_engine,transfer_engine_impl,multi_transport}.cpp
Added getBaseAddr() public accessor methods delegating through transfer engine hierarchy; MultiTransport::getBaseAddr() conditionally retrieves CXL transport base address when USE_CXL defined.
Test Infrastructure
mooncake-store/tests/CMakeLists.txt, mooncake-store/tests/client_integration_test.cpp
New test target cxl_client_integration_test added to build system; existing client_integration_test.cpp updated with FLAGS_protocol argument in MountSegment() calls.
CXL Integration Test Suite
mooncake-store/tests/cxl_client_integration_test.cpp
Comprehensive new 396-line Google Test suite for CXL client under mooncake::testing namespace with: client lifecycle management, log interception for client_id capture, segment initialization with SimpleAllocator, and two test cases (BasicPutGetOperations, BatchPutGetOperations) validating CXL-enabled put/get operations.
Submodule Update
extern/pybind11
Pybind11 submodule pointer updated; no functional changes.

Sequence Diagram

sequenceDiagram
    participant Client
    participant MasterService
    participant SegmentManager
    participant CxlAllocationStrategy
    participant Allocator
    
    Client->>MasterService: Create (with protocol="cxl")
    MasterService->>MasterService: Load CXL config (enable_cxl, cxl_path, cxl_size)
    MasterService->>SegmentManager: Initialize (enable_cxl=true)
    SegmentManager->>Allocator: initializeCxlAllocator(cxl_path, cxl_size)
    Allocator->>Allocator: Create CXL BufferAllocator
    
    Client->>Client: setup_internal (protocol="cxl")
    Client->>Client: GetBaseAddr() from transfer_engine
    Client->>SegmentManager: MountSegment(ptr, size, protocol="cxl")
    
    SegmentManager->>SegmentManager: Detect CXL protocol
    SegmentManager->>Allocator: Use cxl_global_allocator_
    Allocator->>Allocator: Register allocator for segment
    
    Client->>MasterService: Put/BatchPut with CXL segment
    MasterService->>CxlAllocationStrategy: Allocate (preferred_segments=[cxl_segment])
    CxlAllocationStrategy->>Allocator: Allocate from CXL allocator
    Allocator->>Allocator: Allocate buffer, mark as CXL
    CxlAllocationStrategy->>MasterService: Return replica with CXL buffer
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 A hop through CXL's memory express,
Where Mooncake allocates its best!
From config to client, a protocol dance,
Fast CXL paths make performance prance. 🚀
New strategies bloom, tests verify all,
The bunny's buffering builds a grand hall!

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is an empty template with all sections incomplete; no actual description of changes, type of change selections, testing details, or checklist items are provided. Fill in the Description section explaining what CXL storage feature is being added and why; select 'New feature' → 'Mooncake Store' under Type of Change; describe the tests performed; and check all applicable checklist items.
Docstring Coverage ⚠️ Warning Docstring coverage is 15.28% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely identifies the main change: adding CXL storage support to the mooncake store, which aligns with the substantial implementation across header and source files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🤖 Fix all issues with AI agents
In `@mooncake-store/include/allocation_strategy.h`:
- Around line 362-419: The CxlAllocationStrategy::Allocate currently always uses
preferred_segments[0] and returns one replica regardless of replica_num; update
Allocate to (1) validate that preferred_segments is non-empty and choose the
first preferred segment that is not in excluded_segments (if none available
return ErrorCode::INVALID_PARAMS or NO_AVAILABLE_HANDLE), (2) enforce
replica_num by attempting to allocate replica_num buffers from the selected
allocator (use allocator_manager.getAllocators(cxl_segment_name) and
BufferAllocatorBase::allocate in a loop), pushing each buffer after calling
change_to_cxl into replicas, and (3) on any per-replica allocation failure,
release/free already-allocated buffers and return
tl::make_unexpected(ErrorCode::NO_AVAILABLE_HANDLE); keep existing logging and
error returns otherwise.

In `@mooncake-store/src/master_service.cpp`:
- Around line 85-91: The code sets allocation_strategy_ to CxlAllocationStrategy
then calls segment_manager_.initializeCxlAllocator(cxl_path_, cxl_size_) which
can throw, so wrap that initialization in a try-catch that catches
std::exception (and optionally ...) and handles failures by logging the error
(include e.what()), falling back to creating a RandomAllocationStrategy (or
disabling enable_cxl_), and ensuring allocation_strategy_ is left in a valid
state; alternatively modify initializeCxlAllocator to return a success/failure
boolean and handle the false case by logging and switching to
RandomAllocationStrategy; reference enable_cxl_, allocation_strategy_,
CxlAllocationStrategy, segment_manager_.initializeCxlAllocator, cxl_path_, and
cxl_size_ when making the change.

In `@mooncake-store/src/real_client.cpp`:
- Around line 265-334: MC_CXL_DEV_SIZE is not validated and
client_->GetBaseAddr() is not checked: ensure the parsed MC_CXL_DEV_SIZE yields
a positive non-zero cxl_dev_size (reject invalid/zero values and return
tl::unexpected(ErrorCode::INVALID_PARAMS) with a logged error) and check that
client_->GetBaseAddr() != nullptr before calling client_->MountSegment; if base
addr is null log an error and return tl::unexpected(ErrorCode::INVALID_PARAMS).
Update the cxl branch around the getenv/strtoull parsing and the pointer
retrieval (symbols: MC_CXL_DEV_SIZE, cxl_dev_size, client_->GetBaseAddr(),
client_->MountSegment) to perform these validations and early returns.

In `@mooncake-store/src/segment.cpp`:
- Around line 221-236: The unmount path skips calling
MasterMetricManager::instance().dec_total_mem_capacity when segment.protocol ==
"cxl", causing capacity to be incremented on mount but not decremented on
unmount; make the accounting symmetric by always decrementing total mem capacity
for the segment you found. In practice, ensure you only call
dec_total_mem_capacity when the segment was located (use the same found
condition used to set segment_name/is_cxl), and remove the is_cxl short-circuit
so that MasterMetricManager::instance().dec_total_mem_capacity(segment_name,
metrics_dec_capacity) is invoked for CXL segments as well (alternatively move
CXL capacity tracking to a one-time initializer if you intend per-system
accounting).
- Around line 12-38: The CXL branch in the mounting logic skips the duplicate
segment-id and size validation and can accept size==0, causing retries to
overwrite state and double-count metrics; before using segment.id/segment.name
and adding allocators/metrics in the CXL path (the block guarded by
segment_manager_->enable_cxl_ && segment.protocol == "cxl" and
BufferAllocatorType::CACHELIB), add the same guardrails used by the non‑CXL
path: verify size > 0, check that segment_manager_->mounted_segments_ does not
already contain segment.id (and/or segment_manager_->client_by_name_ does not
already map the name), and return the appropriate error code instead of
proceeding; only after those checks pass should you initialize allocator, call
allocator_manager_.addAllocator(...), update client_segments_, mounted_segments_
and client_by_name_, and increment metrics.

In `@mooncake-store/tests/client_integration_test.cpp`:
- Around line 145-146: Update the help text for the FLAGS_protocol flag to
explicitly include "cxl" (e.g., add "cxl" to the list of supported protocols in
the flag description) so it reflects that FLAGS_protocol is passed into
MountSegment; change both occurrences of the flag description associated with
FLAGS_protocol referenced in the tests to mention "cxl" to avoid confusion when
running CXL tests (search for FLAGS_protocol and the DEFINE_string/flag
declaration near its uses and update the help string).

In `@mooncake-store/tests/cxl_client_integration_test.cpp`:
- Around line 265-276: The test allocates 1 * 1024 * 1024 bytes via
client_buffer_allocator_->allocate but deallocates only test_data.size(),
causing a size mismatch; change the deallocation to match the allocated size
(e.g., store size_t allocated_size = 1 * 1024 * 1024 or use the same expression)
and call client_buffer_allocator_->deallocate(buffer, allocated_size) (instead
of deallocating test_data.size()) so allocate(...) and deallocate(...) use the
same size for buffer.
- Around line 90-104: The CreateClient function currently uses a hardcoded
master server address ("10.130.5.132:50051"); add a gflags string flag (e.g.,
DEFINE_string(master_server_address, "10.130.5.132:50051", "Master server
address for testing")) near the top of the test file and replace the literal in
CreateClient with FLAGS_master_server_address so tests can be configured per
environment (refer to CreateClient and the hardcoded address string to locate
the change).
- Around line 300-329: The test leaks memory because buffers allocated via
client_buffer_allocator_->allocate for each Slice stored in batched_slices are
never deallocated; after the BatchPut call completes (test_client_->BatchPut),
iterate the batched_slices (and any target_buffer if used) and call the matching
deallocation on the allocator for each Slice pointer (e.g., deallocate the void*
stored in each Slice) to free all allocated buffers before the test ends; ensure
you handle all entries in batched_slices and clear the container afterward.
- Around line 27-31: The test hardcodes an environment-specific IP in the
FLAGS/DEFINE defaults (DEFINE_string cxl_device_name and
transfer_engine_metadata_url) which breaks portability; update the DEFINE_string
defaults to use localhost or read from an environment variable (e.g.,
getenv("CXL_DEVICE_NAME") / getenv("TRANSFER_ENGINE_METADATA_URL")) instead of
"10.130.5.132", and in the test suite SetUpTestSuite (or equivalent test fixture
initialization) validate that the transfer_engine_metadata_url is present and
fail fast with a clear message if it is missing or empty; reference the
DEFINE_string symbols cxl_device_name and transfer_engine_metadata_url and the
SetUpTestSuite test fixture to locate where to change defaults and add the
validation.
- Around line 129-134: TearDownTestSuite is calling master_.Stop() although the
static InProcMaster master_ is never started; either remove the unused master_
member and its master_.Stop() call in TearDownTestSuite (also remove master_
from the class static members) or ensure the master is actually started by
adding a master_.Start() (and any necessary configuration) in SetUpTestSuite;
locate references to master_, InProcMaster, TearDownTestSuite and SetUpTestSuite
to apply the chosen fix consistently.
🧹 Nitpick comments (9)
mooncake-store/include/master_service.h (1)

915-918: Consider making enable_cxl_ const for consistency.

cxl_path_ and cxl_size_ are declared const, but enable_cxl_ is not. If this is a configuration value set at construction and not modified afterwards, consider making it const bool enable_cxl_; for consistency with the other CXL-related members.

♻️ Suggested change
     const std::string cxl_path_;
     const size_t cxl_size_;
-    bool enable_cxl_;
+    const bool enable_cxl_;
mooncake-store/include/allocator.h (2)

75-76: Pass client_segment_name by const reference.

The change_to_cxl method takes std::string by value. For efficiency, consider passing by const std::string& to avoid unnecessary string copies.

♻️ Suggested change
-    void change_to_cxl(std::string client_segment_name);
+    void change_to_cxl(const std::string& client_segment_name);

80-83: Inconsistent member naming: protocol should be protocol_.

The new protocol member at line 83 doesn't follow the naming convention used by other private members in this class (allocator_, buffer_ptr_, size_, offset_handle_, segment_name_). Renaming to protocol_ would maintain consistency.

♻️ Suggested change
     std::string segment_name_;
     void* buffer_ptr_{nullptr};
     std::size_t size_{0};
-    std::string protocol{"tcp"};
+    std::string protocol_{"tcp"};
mooncake-store/src/segment.cpp (1)

310-321: Guard against re-initializing the CXL allocator

Re-running initialization could replace the allocator while segments still use it. Consider an idempotency guard to prevent accidental reinit.

♻️ Suggested refinement
 void SegmentManager::initializeCxlAllocator(const std::string& cxl_path,
                                             const size_t cxl_size) {
+    if (cxl_global_allocator_) {
+        LOG(WARNING) << "CXL global allocator already initialized";
+        return;
+    }
     VLOG(1) << "Init CXL global allocator.";
mooncake-store/src/allocator.cpp (1)

49-59: Add defensive checks before converting to CXL offset

change_to_cxl subtracts DEFAULT_CXL_BASE unconditionally; if called twice or if the address is below the base, it can underflow and later free the wrong address. A small guard makes this safer.

♻️ Suggested refinement
 void AllocatedBuffer::change_to_cxl(std::string client_segment_name) {
     uint64_t offset_raw = reinterpret_cast<uintptr_t>(buffer_ptr_);
+    if (protocol == "cxl") {
+        LOG(WARNING) << "change_to_cxl called twice; ignoring";
+        return;
+    }
+    if (offset_raw < DEFAULT_CXL_BASE) {
+        LOG(ERROR) << "CXL base underflow: " << offset_raw;
+        return;
+    }
     buffer_ptr_ = reinterpret_cast<void*>(offset_raw - DEFAULT_CXL_BASE);
     protocol = "cxl";
     segment_name_ = client_segment_name;
 }
mooncake-store/src/client_service.cpp (1)

1616-1617: Consider adding a null check for defensive programming.

GetBaseAddr() directly dereferences transfer_engine_ without verifying it's initialized. While the transfer engine is typically set during Client::Create, calling this method before initialization completes could cause undefined behavior.

🛡️ Optional defensive null check
-void* Client::GetBaseAddr() { return transfer_engine_->getBaseAddr(); }
+void* Client::GetBaseAddr() {
+    if (!transfer_engine_) {
+        LOG(ERROR) << "GetBaseAddr called before transfer engine initialized";
+        return nullptr;
+    }
+    return transfer_engine_->getBaseAddr();
+}
mooncake-store/tests/cxl_client_integration_test.cpp (2)

172-195: Consider using RAII for log sink management.

The manual new/delete pattern for test_client_sink is error-prone. If an exception occurs between allocation and deletion, memory will leak. Using std::unique_ptr would be safer.

♻️ Suggested RAII-based approach
 static void InitializeClients() {
     // This client is used for testing purposes.
     // Capture test_client_ client_id from logs
-    ClientIdCaptureSink* test_client_sink = new ClientIdCaptureSink();
+    auto test_client_sink = std::make_unique<ClientIdCaptureSink>();
-    google::AddLogSink(test_client_sink);
+    google::AddLogSink(test_client_sink.get());

     test_client_ = CreateClient("localhost:17813");
     ASSERT_TRUE(test_client_ != nullptr);

     // Wait for logs to flush
     std::this_thread::sleep_for(std::chrono::milliseconds(200));
-    google::RemoveLogSink(test_client_sink);
+    google::RemoveLogSink(test_client_sink.get());

     if (!test_client_sink->captured_client_id.empty()) {
         UUID extracted_id =
             ParseClientId(test_client_sink->captured_client_id);
         if (extracted_id.first != 0 || extracted_id.second != 0) {
             test_client_id_ = extracted_id;
             LOG(INFO) << "Captured test_client_id: "
                       << FormatClientId(test_client_id_);
         }
     }
-    delete test_client_sink;
+    // unique_ptr automatically handles cleanup
 }

220-229: Remove unused static members.

Several static members are declared but never used: segment_ptr_, ram_buffer_size_, master_address_, metadata_url_, and is_cxl. This is dead code that adds confusion.

♻️ Remove unused declarations
     static std::unique_ptr<SimpleAllocator> client_buffer_allocator_;
-    static void* segment_ptr_;
-    static size_t ram_buffer_size_;
     static void* test_client_segment_ptr_;
     static size_t test_client_ram_buffer_size_;
     static uint64_t default_kv_lease_ttl_;
-    static InProcMaster master_;
-    static std::string master_address_;
-    static std::string metadata_url_;
     static UUID test_client_id_;
-    static inline bool is_cxl = false;
 };

 // Static members initialization
 std::shared_ptr<Client> ClientIntegrationTestCxl::test_client_ = nullptr;
-void* ClientIntegrationTestCxl::segment_ptr_ = nullptr;
 void* ClientIntegrationTestCxl::test_client_segment_ptr_ = nullptr;
 std::unique_ptr<SimpleAllocator>
     ClientIntegrationTestCxl::client_buffer_allocator_ = nullptr;
-size_t ClientIntegrationTestCxl::ram_buffer_size_ = 0;
 size_t ClientIntegrationTestCxl::test_client_ram_buffer_size_ = 0;
 uint64_t ClientIntegrationTestCxl::default_kv_lease_ttl_ = 0;
-InProcMaster ClientIntegrationTestCxl::master_;
-std::string ClientIntegrationTestCxl::master_address_;
-std::string ClientIntegrationTestCxl::metadata_url_;
 UUID ClientIntegrationTestCxl::test_client_id_{0, 0};
mooncake-store/include/master_config.h (1)

531-547: Clarify intent with a comment for the CXL allocator override.

The logic at lines 546-547 silently overrides the configured memory_allocator to CACHELIB when CXL is enabled. While this is likely intentional (CXL requires CACHELIB), a brief comment would help future maintainers understand this constraint.

📝 Add explanatory comment
-        auto cxl_allocator_type = BufferAllocatorType::CACHELIB;
-
+        // CXL mode requires CACHELIB allocator for proper memory management
+        constexpr auto cxl_allocator_type = BufferAllocatorType::CACHELIB;
         default_kv_lease_ttl = config.default_kv_lease_ttl;
         ...
         memory_allocator =
             config.enable_cxl ? cxl_allocator_type : config.memory_allocator;

Comment on lines +362 to +419
class CxlAllocationStrategy : public AllocationStrategy {
public:
CxlAllocationStrategy() = default;
tl::expected<std::vector<Replica>, ErrorCode> Allocate(
const AllocatorManager& allocator_manager, const size_t slice_length,
const size_t replica_num = 1,
const std::vector<std::string>& preferred_segments =
std::vector<std::string>(),
const std::set<std::string>& excluded_segments =
std::set<std::string>()) {
if (slice_length == 0 || replica_num == 0) {
return tl::make_unexpected(ErrorCode::INVALID_PARAMS);
}

if (preferred_segments.empty()) {
LOG(ERROR) << "Preferred_segments is empty.";
return tl::make_unexpected(ErrorCode::INVALID_PARAMS);
}

const std::string& cxl_segment_name = preferred_segments[0];

VLOG(1) << "Do cxl allocate, overwritten segment=" << cxl_segment_name;

const auto cxl_allocators =
allocator_manager.getAllocators(cxl_segment_name);

if (cxl_allocators == nullptr || cxl_allocators->size() == 0) {
return tl::make_unexpected(ErrorCode::NO_AVAILABLE_HANDLE);
}
std::shared_ptr<BufferAllocatorBase> cxl_allocator =
(*cxl_allocators)[0];
if (!cxl_allocator) {
LOG(ERROR) << "No CXL allocator in preferred_segment";
return tl::make_unexpected(ErrorCode::NO_AVAILABLE_HANDLE);
}

std::vector<Replica> replicas;
replicas.reserve(replica_num);

auto buffer = cxl_allocator->allocate(slice_length);
if (!buffer) {
return tl::make_unexpected(ErrorCode::NO_AVAILABLE_HANDLE);
}

buffer->change_to_cxl(cxl_segment_name);
replicas.emplace_back(std::move(buffer), ReplicaStatus::PROCESSING);

VLOG(1) << "Successfully allocated " << replicas.size()
<< " CXL replica.";
return replicas;
}

tl::expected<Replica, ErrorCode> AllocateFrom(
const AllocatorManager& allocator_manager, const size_t slice_length,
const std::string& segment_name) {
return tl::make_unexpected(ErrorCode::NO_AVAILABLE_HANDLE);
}
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

CXL allocation ignores exclusions and replica count

The strategy always allocates from preferred_segments[0] even if it’s excluded, and it silently returns a single replica even when replica_num > 1. That violates caller constraints and replication expectations. Validate exclusions and explicitly reject/handle replica counts > 1.

🔧 Suggested fix
 const std::string& cxl_segment_name = preferred_segments[0];
+if (excluded_segments.contains(cxl_segment_name)) {
+    return tl::make_unexpected(ErrorCode::NO_AVAILABLE_HANDLE);
+}
+if (replica_num != 1) {
+    LOG(ERROR) << "CXL allocation supports only 1 replica, replica_num="
+               << replica_num;
+    return tl::make_unexpected(ErrorCode::INVALID_PARAMS);
+}
 
 VLOG(1) << "Do cxl allocate, overwritten segment=" << cxl_segment_name;
🤖 Prompt for AI Agents
In `@mooncake-store/include/allocation_strategy.h` around lines 362 - 419, The
CxlAllocationStrategy::Allocate currently always uses preferred_segments[0] and
returns one replica regardless of replica_num; update Allocate to (1) validate
that preferred_segments is non-empty and choose the first preferred segment that
is not in excluded_segments (if none available return ErrorCode::INVALID_PARAMS
or NO_AVAILABLE_HANDLE), (2) enforce replica_num by attempting to allocate
replica_num buffers from the selected allocator (use
allocator_manager.getAllocators(cxl_segment_name) and
BufferAllocatorBase::allocate in a loop), pushing each buffer after calling
change_to_cxl into replicas, and (3) on any per-replica allocation failure,
release/free already-allocated buffers and return
tl::make_unexpected(ErrorCode::NO_AVAILABLE_HANDLE); keep existing logging and
error returns otherwise.

Comment on lines +85 to +91
if (enable_cxl_) {
allocation_strategy_ = std::make_shared<CxlAllocationStrategy>();
segment_manager_.initializeCxlAllocator(cxl_path_, cxl_size_);
VLOG(1) << "action=start_cxl_global_allocator";
} else {
allocation_strategy_ = std::make_shared<RandomAllocationStrategy>();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

rg -nP 'initializeCxlAllocator' --type=cpp --type=hpp

Repository: XucSh/Mooncake

Length of output: 85


🏁 Script executed:

rg -nP 'initializeCxlAllocator' --type=cpp

Repository: XucSh/Mooncake

Length of output: 375


🏁 Script executed:

# Read the function declaration
sed -n '220,230p' mooncake-store/include/segment.h

# Read the function implementation
sed -n '310,330p' mooncake-store/src/segment.cpp

Repository: XucSh/Mooncake

Length of output: 1088


🏁 Script executed:

# Read more context around the call in master_service.cpp
sed -n '75,100p' mooncake-store/src/master_service.cpp

Repository: XucSh/Mooncake

Length of output: 979


🏁 Script executed:

# Search for cxl_path_ initialization and validation
rg -nP 'cxl_path_\s*=' mooncake-store/src/master_service.cpp | head -20

# Search for CachelibBufferAllocator constructor
rg -nP 'CachelibBufferAllocator' --type=cpp -A 5 | head -40

Repository: XucSh/Mooncake

Length of output: 3234


🏁 Script executed:

# Look for try-catch or other error handling in the initialization function
rg -nP 'try|catch|throw' mooncake-store/src/master_service.cpp | head -20

# Get the full context of the Init function
rg -nB 30 'if \(enable_cxl_\)' mooncake-store/src/master_service.cpp | grep -A 30 'void MasterService::Init'

Repository: XucSh/Mooncake

Length of output: 518


🏁 Script executed:

# Find the complete Init function to see if it has try-catch
rg -nP 'void MasterService::Init' mooncake-store/src/master_service.cpp -A 60 | head -80

Repository: XucSh/Mooncake

Length of output: 40


🏁 Script executed:

# Try finding Init function with different pattern
sed -n '1,100p' mooncake-store/src/master_service.cpp | tail -50

Repository: XucSh/Mooncake

Length of output: 1982


Add error handling for CXL allocator initialization.
The initializeCxlAllocator() function has a void return type and contains no try-catch, yet it calls CachelibBufferAllocator which can throw exceptions if the size or base address is invalid (as documented in segment.cpp:73-74 and shown by the try-catch in addSegment()). If initialization fails, the service will proceed with CxlAllocationStrategy but a potentially uninitialized allocator. Either wrap the call in try-catch, change the return type to report errors, or validate cxl_path_ and cxl_size_ before calling the function.

🤖 Prompt for AI Agents
In `@mooncake-store/src/master_service.cpp` around lines 85 - 91, The code sets
allocation_strategy_ to CxlAllocationStrategy then calls
segment_manager_.initializeCxlAllocator(cxl_path_, cxl_size_) which can throw,
so wrap that initialization in a try-catch that catches std::exception (and
optionally ...) and handles failures by logging the error (include e.what()),
falling back to creating a RandomAllocationStrategy (or disabling enable_cxl_),
and ensuring allocation_strategy_ is left in a valid state; alternatively modify
initializeCxlAllocator to return a success/failure boolean and handle the false
case by logging and switching to RandomAllocationStrategy; reference
enable_cxl_, allocation_strategy_, CxlAllocationStrategy,
segment_manager_.initializeCxlAllocator, cxl_path_, and cxl_size_ when making
the change.

Comment on lines +265 to 334
if (protocol == "cxl") {
size_t cxl_dev_size = 0;
const char *env = std::getenv("MC_CXL_DEV_SIZE");
if (env) {
char *end = nullptr;
unsigned long long val = strtoull(env, &end, 10);
if (end != env && *end == '\0')
cxl_dev_size = static_cast<size_t>(val);
} else {
ptr =
allocate_buffer_allocator_memory(segment_size, this->protocol);
}

if (!ptr) {
LOG(ERROR) << "Failed to allocate segment memory";
LOG(FATAL) << "MC_CXL_DEV_SIZE not set";
return tl::unexpected(ErrorCode::INVALID_PARAMS);
}
if (this->protocol == "ascend") {
ascend_segment_ptrs_.emplace_back(ptr);
} else if (should_use_hugepage) {
hugepage_segment_ptrs_.emplace_back(
ptr, HugepageSegmentDeleter{mapped_size});
} else {
segment_ptrs_.emplace_back(ptr);
}
auto mount_result = client_->MountSegment(ptr, mapped_size);

void *ptr = client_->GetBaseAddr();
LOG(INFO) << "Mounting CXL segment: " << cxl_dev_size << " bytes, "
<< ptr;
auto mount_result = client_->MountSegment(ptr, cxl_dev_size, protocol);
if (!mount_result.has_value()) {
LOG(ERROR) << "Failed to mount segment: "
<< toString(mount_result.error());
return tl::unexpected(mount_result.error());
}
}
if (total_glbseg_size == 0) {
LOG(INFO) << "Global segment size is 0, skip mounting segment";

} else {
auto max_mr_size = globalConfig().max_mr_size; // Max segment size
uint64_t total_glbseg_size = global_segment_size; // For logging
uint64_t current_glbseg_size = 0; // For logging
while (global_segment_size > 0) {
size_t segment_size = std::min(global_segment_size, max_mr_size);
global_segment_size -= segment_size;
current_glbseg_size += segment_size;
LOG(INFO) << "Mounting segment: " << segment_size << " bytes, "
<< current_glbseg_size << " of " << total_glbseg_size;

size_t mapped_size = segment_size;
void *ptr = nullptr;
if (should_use_hugepage) {
mapped_size =
align_up(segment_size, get_hugepage_size_from_env());
ptr = allocate_buffer_mmap_memory(mapped_size,
get_hugepage_size_from_env());
} else {
ptr = allocate_buffer_allocator_memory(segment_size,
this->protocol);
}

if (!ptr) {
LOG(ERROR) << "Failed to allocate segment memory";
return tl::unexpected(ErrorCode::INVALID_PARAMS);
}
if (this->protocol == "ascend") {
ascend_segment_ptrs_.emplace_back(ptr);
} else if (should_use_hugepage) {
hugepage_segment_ptrs_.emplace_back(
ptr, HugepageSegmentDeleter{mapped_size});
} else {
segment_ptrs_.emplace_back(ptr);
}
auto mount_result =
client_->MountSegment(ptr, mapped_size, protocol);
if (!mount_result.has_value()) {
LOG(ERROR) << "Failed to mount segment: "
<< toString(mount_result.error());
return tl::unexpected(mount_result.error());
}
}
if (total_glbseg_size == 0) {
LOG(INFO) << "Global segment size is 0, skip mounting segment";
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate MC_CXL_DEV_SIZE and base address before mounting

If MC_CXL_DEV_SIZE is set but invalid (or zero), cxl_dev_size stays 0 and a zero-sized segment gets mounted. Also, GetBaseAddr() could return null. Validate the env var and size, and guard the base address.

🔧 Suggested fix
 if (protocol == "cxl") {
-    size_t cxl_dev_size = 0;
     const char *env = std::getenv("MC_CXL_DEV_SIZE");
-    if (env) {
-        char *end = nullptr;
-        unsigned long long val = strtoull(env, &end, 10);
-        if (end != env && *end == '\0')
-            cxl_dev_size = static_cast<size_t>(val);
-    } else {
-        LOG(FATAL) << "MC_CXL_DEV_SIZE not set";
-        return tl::unexpected(ErrorCode::INVALID_PARAMS);
-    }
+    if (!env) {
+        LOG(ERROR) << "MC_CXL_DEV_SIZE not set";
+        return tl::unexpected(ErrorCode::INVALID_PARAMS);
+    }
+    char *end = nullptr;
+    unsigned long long val = strtoull(env, &end, 10);
+    if (end == env || *end != '\0' || val == 0) {
+        LOG(ERROR) << "Invalid MC_CXL_DEV_SIZE: " << env;
+        return tl::unexpected(ErrorCode::INVALID_PARAMS);
+    }
+    size_t cxl_dev_size = static_cast<size_t>(val);
 
     void *ptr = client_->GetBaseAddr();
+    if (!ptr) {
+        LOG(ERROR) << "CXL base address is null";
+        return tl::unexpected(ErrorCode::INTERNAL_ERROR);
+    }
     LOG(INFO) << "Mounting CXL segment: " << cxl_dev_size << " bytes, "
               << ptr;
🤖 Prompt for AI Agents
In `@mooncake-store/src/real_client.cpp` around lines 265 - 334, MC_CXL_DEV_SIZE
is not validated and client_->GetBaseAddr() is not checked: ensure the parsed
MC_CXL_DEV_SIZE yields a positive non-zero cxl_dev_size (reject invalid/zero
values and return tl::unexpected(ErrorCode::INVALID_PARAMS) with a logged error)
and check that client_->GetBaseAddr() != nullptr before calling
client_->MountSegment; if base addr is null log an error and return
tl::unexpected(ErrorCode::INVALID_PARAMS). Update the cxl branch around the
getenv/strtoull parsing and the pointer retrieval (symbols: MC_CXL_DEV_SIZE,
cxl_dev_size, client_->GetBaseAddr(), client_->MountSegment) to perform these
validations and early returns.

Comment on lines +12 to +38
// Check if cxl storage is enable
if (segment_manager_->enable_cxl_ && segment.protocol == "cxl") {
LOG(INFO) << "Start Mounting CXL Segment.";
if (segment_manager_->memory_allocator_ ==
BufferAllocatorType::CACHELIB) {
auto allocator = segment_manager_->cxl_global_allocator_;
if (segment_manager_->cxl_global_allocator_ == nullptr) {
LOG(ERROR) << "Cxl global allocator has not been initialized.";
return ErrorCode::INTERNAL_ERROR;
}
segment_manager_->allocator_manager_.addAllocator(segment.name,
allocator);
segment_manager_->client_segments_[client_id].push_back(segment.id);
segment_manager_->mounted_segments_[segment.id] = {
segment, SegmentStatus::OK, allocator};
segment_manager_->client_by_name_[segment.name] = client_id;
MasterMetricManager::instance().inc_total_mem_capacity(segment.name,
size);

LOG(INFO) << "[CXL Segment Mounted Successfully] Segment name: "
<< segment.name
<< ", Mount size: " << (size / 1024 / 1024 / 1024)
<< " GB";
return ErrorCode::OK;
}
return ErrorCode::INTERNAL_ERROR;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

CXL MountSegment skips duplicate/size validation

The CXL branch bypasses the existing segment-id check and accepts size=0, so retries can overwrite state and double-count metrics. Add the same guardrails used by the non‑CXL path.

🔧 Suggested fix
 if (segment_manager_->enable_cxl_ && segment.protocol == "cxl") {
+    if (size == 0) {
+        LOG(ERROR) << "size=0 is invalid for CXL segment";
+        return ErrorCode::INVALID_PARAMS;
+    }
+    auto exist_segment_it =
+        segment_manager_->mounted_segments_.find(segment.id);
+    if (exist_segment_it != segment_manager_->mounted_segments_.end()) {
+        auto& exist_segment = exist_segment_it->second;
+        if (exist_segment.status == SegmentStatus::OK) {
+            LOG(WARNING) << "segment_name=" << segment.name
+                         << ", warn=segment_already_exists";
+            return ErrorCode::SEGMENT_ALREADY_EXISTS;
+        } else {
+            LOG(ERROR) << "segment_name=" << segment.name
+                       << ", error=segment_already_exists_but_not_ok"
+                       << ", status=" << exist_segment.status;
+            return ErrorCode::UNAVAILABLE_IN_CURRENT_STATUS;
+        }
+    }
     LOG(INFO) << "Start Mounting CXL Segment.";
     if (segment_manager_->memory_allocator_ ==
         BufferAllocatorType::CACHELIB) {
🤖 Prompt for AI Agents
In `@mooncake-store/src/segment.cpp` around lines 12 - 38, The CXL branch in the
mounting logic skips the duplicate segment-id and size validation and can accept
size==0, causing retries to overwrite state and double-count metrics; before
using segment.id/segment.name and adding allocators/metrics in the CXL path (the
block guarded by segment_manager_->enable_cxl_ && segment.protocol == "cxl" and
BufferAllocatorType::CACHELIB), add the same guardrails used by the non‑CXL
path: verify size > 0, check that segment_manager_->mounted_segments_ does not
already contain segment.id (and/or segment_manager_->client_by_name_ does not
already map the name), and return the appropriate error code instead of
proceeding; only after those checks pass should you initialize allocator, call
allocator_manager_.addAllocator(...), update client_segments_, mounted_segments_
and client_by_name_, and increment metrics.

Comment on lines +221 to +236
bool is_cxl = false;
auto&& segment = segment_manager_->mounted_segments_.find(segment_id);
if (segment != segment_manager_->mounted_segments_.end()) {
segment_name = segment->second.segment.name;
// Also remove from segment_name_client_id_map_
segment_manager_->client_by_name_.erase(segment_name);
is_cxl = (segment->second.segment.protocol == "cxl");
}
// Remove from mounted_segments_
segment_manager_->mounted_segments_.erase(segment_id);

// Decrease the total capacity
MasterMetricManager::instance().dec_total_mem_capacity(
segment_name, metrics_dec_capacity);
if (!is_cxl) {
MasterMetricManager::instance().dec_total_mem_capacity(
segment_name, metrics_dec_capacity);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

CXL capacity metrics become unbalanced on unmount

CXL mount increments total capacity, but unmount now skips decrement. Re-mounting (or multiple CXL segments) will inflate capacity metrics. Either keep accounting symmetric or move CXL capacity tracking to a one-time initializer.

🔧 Suggested fix (symmetry)
-    if (!is_cxl) {
-        MasterMetricManager::instance().dec_total_mem_capacity(
-            segment_name, metrics_dec_capacity);
-    }
+    MasterMetricManager::instance().dec_total_mem_capacity(
+        segment_name, metrics_dec_capacity);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
bool is_cxl = false;
auto&& segment = segment_manager_->mounted_segments_.find(segment_id);
if (segment != segment_manager_->mounted_segments_.end()) {
segment_name = segment->second.segment.name;
// Also remove from segment_name_client_id_map_
segment_manager_->client_by_name_.erase(segment_name);
is_cxl = (segment->second.segment.protocol == "cxl");
}
// Remove from mounted_segments_
segment_manager_->mounted_segments_.erase(segment_id);
// Decrease the total capacity
MasterMetricManager::instance().dec_total_mem_capacity(
segment_name, metrics_dec_capacity);
if (!is_cxl) {
MasterMetricManager::instance().dec_total_mem_capacity(
segment_name, metrics_dec_capacity);
}
bool is_cxl = false;
auto&& segment = segment_manager_->mounted_segments_.find(segment_id);
if (segment != segment_manager_->mounted_segments_.end()) {
segment_name = segment->second.segment.name;
// Also remove from segment_name_client_id_map_
segment_manager_->client_by_name_.erase(segment_name);
is_cxl = (segment->second.segment.protocol == "cxl");
}
// Remove from mounted_segments_
segment_manager_->mounted_segments_.erase(segment_id);
// Decrease the total capacity
MasterMetricManager::instance().dec_total_mem_capacity(
segment_name, metrics_dec_capacity);
🤖 Prompt for AI Agents
In `@mooncake-store/src/segment.cpp` around lines 221 - 236, The unmount path
skips calling MasterMetricManager::instance().dec_total_mem_capacity when
segment.protocol == "cxl", causing capacity to be incremented on mount but not
decremented on unmount; make the accounting symmetric by always decrementing
total mem capacity for the segment you found. In practice, ensure you only call
dec_total_mem_capacity when the segment was located (use the same found
condition used to set segment_name/is_cxl), and remove the is_cxl short-circuit
so that MasterMetricManager::instance().dec_total_mem_capacity(segment_name,
metrics_dec_capacity) is invoked for CXL segments as well (alternatively move
CXL capacity tracking to a one-time initializer if you intend per-system
accounting).

Comment on lines +27 to +31
DEFINE_string(cxl_device_name, "/dev/dax0.0", "Device name for cxl");
DEFINE_uint64(cxl_device_size, 8589934592, "Device Size for cxl");
DEFINE_bool(auto_disc, false, "Auto discover tcp devices");
DEFINE_string(transfer_engine_metadata_url, "etcd://10.130.5.132:2379",
"Metadata connection string for transfer engine");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hardcoded IP addresses reduce test portability.

The hardcoded IP address 10.130.5.132 in cxl_device_name default and transfer_engine_metadata_url will fail in environments where this specific infrastructure isn't available. Consider using environment variables or localhost for default values.

🔧 Suggested fix for portability
-DEFINE_string(cxl_device_name, "/dev/dax0.0", "Device name for cxl");
-DEFINE_uint64(cxl_device_size, 8589934592, "Device Size for cxl");
-DEFINE_bool(auto_disc, false, "Auto discover tcp devices");
-DEFINE_string(transfer_engine_metadata_url, "etcd://10.130.5.132:2379",
-              "Metadata connection string for transfer engine");
+DEFINE_string(cxl_device_name, "/dev/dax0.0", "Device name for cxl");
+DEFINE_uint64(cxl_device_size, 8589934592, "Device Size for cxl");
+DEFINE_bool(auto_disc, false, "Auto discover tcp devices");
+DEFINE_string(transfer_engine_metadata_url, "",
+              "Metadata connection string for transfer engine (required)");

Then validate in SetUpTestSuite that the URL is provided.

🤖 Prompt for AI Agents
In `@mooncake-store/tests/cxl_client_integration_test.cpp` around lines 27 - 31,
The test hardcodes an environment-specific IP in the FLAGS/DEFINE defaults
(DEFINE_string cxl_device_name and transfer_engine_metadata_url) which breaks
portability; update the DEFINE_string defaults to use localhost or read from an
environment variable (e.g., getenv("CXL_DEVICE_NAME") /
getenv("TRANSFER_ENGINE_METADATA_URL")) instead of "10.130.5.132", and in the
test suite SetUpTestSuite (or equivalent test fixture initialization) validate
that the transfer_engine_metadata_url is present and fail fast with a clear
message if it is missing or empty; reference the DEFINE_string symbols
cxl_device_name and transfer_engine_metadata_url and the SetUpTestSuite test
fixture to locate where to change defaults and add the validation.

Comment on lines +90 to +104
static std::shared_ptr<Client> CreateClient(const std::string& host_name) {
auto client_opt = Client::Create(
host_name, // Local hostname
FLAGS_transfer_engine_metadata_url, // Metadata connection string
FLAGS_protocol, // Transfer protocol
std::nullopt, // RDMA device names (auto-discovery)
"10.130.5.132:50051");

EXPECT_TRUE(client_opt.has_value())
<< "Failed to create client with host_name: " << host_name;
if (!client_opt.has_value()) {
return nullptr;
}
return client_opt.value();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hardcoded master server address limits test flexibility.

The master server address "10.130.5.132:50051" at line 96 is hardcoded. This should be configurable to allow running tests in different environments.

🔧 Suggested fix

Add a flag at the top of the file:

DEFINE_string(master_server_address, "10.130.5.132:50051",
              "Master server address for testing");

Then use it in CreateClient:

         auto client_opt = Client::Create(
             host_name,                           // Local hostname
             FLAGS_transfer_engine_metadata_url,  // Metadata connection string
             FLAGS_protocol,                      // Transfer protocol
             std::nullopt,  // RDMA device names (auto-discovery)
-            "10.130.5.132:50051");
+            FLAGS_master_server_address);
🤖 Prompt for AI Agents
In `@mooncake-store/tests/cxl_client_integration_test.cpp` around lines 90 - 104,
The CreateClient function currently uses a hardcoded master server address
("10.130.5.132:50051"); add a gflags string flag (e.g.,
DEFINE_string(master_server_address, "10.130.5.132:50051", "Master server
address for testing")) near the top of the test file and replace the literal in
CreateClient with FLAGS_master_server_address so tests can be configured per
environment (refer to CreateClient and the hardcoded address string to locate
the change).

Comment on lines +129 to +134
static void TearDownTestSuite() {
CleanupSegment();
CleanupClients();
master_.Stop();
google::ShutdownGoogleLogging();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

master_.Stop() called but master never started.

The static InProcMaster master_ is never started via a Start() call, yet Stop() is called in teardown. This suggests either incomplete test setup or dead code. If the test relies on an external master (given the hardcoded IP), master_ shouldn't be a member at all.

🔧 Suggested fix

Either remove the unused master_ member and its Stop() call:

 static void TearDownTestSuite() {
     CleanupSegment();
     CleanupClients();
-    master_.Stop();
     google::ShutdownGoogleLogging();
 }

And remove from static members:

-static InProcMaster master_;

Or properly start the master in SetUpTestSuite if it should be used.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
static void TearDownTestSuite() {
CleanupSegment();
CleanupClients();
master_.Stop();
google::ShutdownGoogleLogging();
}
static void TearDownTestSuite() {
CleanupSegment();
CleanupClients();
google::ShutdownGoogleLogging();
}
🤖 Prompt for AI Agents
In `@mooncake-store/tests/cxl_client_integration_test.cpp` around lines 129 - 134,
TearDownTestSuite is calling master_.Stop() although the static InProcMaster
master_ is never started; either remove the unused master_ member and its
master_.Stop() call in TearDownTestSuite (also remove master_ from the class
static members) or ensure the master is actually started by adding a
master_.Start() (and any necessary configuration) in SetUpTestSuite; locate
references to master_, InProcMaster, TearDownTestSuite and SetUpTestSuite to
apply the chosen fix consistently.

Comment on lines +265 to +276
buffer = client_buffer_allocator_->allocate(1 * 1024 * 1024);
slices.clear();
slices.emplace_back(Slice{buffer, test_data.size()});
// Verify data through Get operation
auto get_result = test_client_->Get(key, slices);
ASSERT_TRUE(get_result.has_value())
<< "Get operation failed: " << toString(get_result.error());
ASSERT_EQ(slices.size(), 1);
ASSERT_EQ(slices[0].size, test_data.size());
ASSERT_EQ(slices[0].ptr, buffer);
ASSERT_EQ(memcmp(slices[0].ptr, test_data.data(), test_data.size()), 0);
client_buffer_allocator_->deallocate(buffer, test_data.size());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Allocation/deallocation size mismatch.

At line 265, 1 * 1024 * 1024 bytes are allocated, but at line 276, only test_data.size() bytes are deallocated. This mismatch may cause memory tracking issues in SimpleAllocator or memory leaks depending on the allocator implementation.

🐛 Fix the deallocation size
-    buffer = client_buffer_allocator_->allocate(1 * 1024 * 1024);
+    const size_t get_buffer_size = 1 * 1024 * 1024;
+    buffer = client_buffer_allocator_->allocate(get_buffer_size);
     slices.clear();
     slices.emplace_back(Slice{buffer, test_data.size()});
     // Verify data through Get operation
     auto get_result = test_client_->Get(key, slices);
     ASSERT_TRUE(get_result.has_value())
         << "Get operation failed: " << toString(get_result.error());
     ASSERT_EQ(slices.size(), 1);
     ASSERT_EQ(slices[0].size, test_data.size());
     ASSERT_EQ(slices[0].ptr, buffer);
     ASSERT_EQ(memcmp(slices[0].ptr, test_data.data(), test_data.size()), 0);
-    client_buffer_allocator_->deallocate(buffer, test_data.size());
+    client_buffer_allocator_->deallocate(buffer, get_buffer_size);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
buffer = client_buffer_allocator_->allocate(1 * 1024 * 1024);
slices.clear();
slices.emplace_back(Slice{buffer, test_data.size()});
// Verify data through Get operation
auto get_result = test_client_->Get(key, slices);
ASSERT_TRUE(get_result.has_value())
<< "Get operation failed: " << toString(get_result.error());
ASSERT_EQ(slices.size(), 1);
ASSERT_EQ(slices[0].size, test_data.size());
ASSERT_EQ(slices[0].ptr, buffer);
ASSERT_EQ(memcmp(slices[0].ptr, test_data.data(), test_data.size()), 0);
client_buffer_allocator_->deallocate(buffer, test_data.size());
const size_t get_buffer_size = 1 * 1024 * 1024;
buffer = client_buffer_allocator_->allocate(get_buffer_size);
slices.clear();
slices.emplace_back(Slice{buffer, test_data.size()});
// Verify data through Get operation
auto get_result = test_client_->Get(key, slices);
ASSERT_TRUE(get_result.has_value())
<< "Get operation failed: " << toString(get_result.error());
ASSERT_EQ(slices.size(), 1);
ASSERT_EQ(slices[0].size, test_data.size());
ASSERT_EQ(slices[0].ptr, buffer);
ASSERT_EQ(memcmp(slices[0].ptr, test_data.data(), test_data.size()), 0);
client_buffer_allocator_->deallocate(buffer, get_buffer_size);
🤖 Prompt for AI Agents
In `@mooncake-store/tests/cxl_client_integration_test.cpp` around lines 265 - 276,
The test allocates 1 * 1024 * 1024 bytes via client_buffer_allocator_->allocate
but deallocates only test_data.size(), causing a size mismatch; change the
deallocation to match the allocated size (e.g., store size_t allocated_size = 1
* 1024 * 1024 or use the same expression) and call
client_buffer_allocator_->deallocate(buffer, allocated_size) (instead of
deallocating test_data.size()) so allocate(...) and deallocate(...) use the same
size for buffer.

Comment on lines +300 to +329
for (int i = 0; i < batch_sz; i++) {
keys.push_back("test_key_batch_put_" + std::to_string(i));
test_data_list.push_back("test_data_" + std::to_string(i));
}
void* buffer = nullptr;
void* target_buffer = nullptr;
batched_slices.reserve(batch_sz);
for (int i = 0; i < batch_sz; i++) {
std::vector<Slice> slices;
buffer = client_buffer_allocator_->allocate(test_data_list[i].size());
memcpy(buffer, test_data_list[i].data(), test_data_list[i].size());
slices.emplace_back(Slice{buffer, test_data_list[i].size()});
batched_slices.push_back(std::move(slices));
}
// Test Batch Put operation
ReplicateConfig config;
config.replica_num = 1;
auto start = std::chrono::high_resolution_clock::now();
auto batch_put_results =
test_client_->BatchPut(keys, batched_slices, config);
// Check that all operations succeeded
for (const auto& result : batch_put_results) {
ASSERT_TRUE(result.has_value()) << "BatchPut operation failed";
}
auto end = std::chrono::high_resolution_clock::now();
LOG(INFO) << "Time taken for BatchPut: "
<< std::chrono::duration_cast<std::chrono::microseconds>(end -
start)
.count()
<< "us";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Memory leak: batch put buffers are never deallocated.

The buffers allocated in the loop at lines 307-313 for batched_slices are never freed after the batch put operation completes. This causes a memory leak.

🐛 Add cleanup for batch put buffers
     auto batch_put_results =
         test_client_->BatchPut(keys, batched_slices, config);
     // Check that all operations succeeded
     for (const auto& result : batch_put_results) {
         ASSERT_TRUE(result.has_value()) << "BatchPut operation failed";
     }
     auto end = std::chrono::high_resolution_clock::now();
     LOG(INFO) << "Time taken for BatchPut: "
               << std::chrono::duration_cast<std::chrono::microseconds>(end -
                                                                        start)
                      .count()
               << "us";
+
+    // Clean up batch put buffers
+    for (int i = 0; i < batch_sz; i++) {
+        client_buffer_allocator_->deallocate(batched_slices[i][0].ptr,
+                                             test_data_list[i].size());
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for (int i = 0; i < batch_sz; i++) {
keys.push_back("test_key_batch_put_" + std::to_string(i));
test_data_list.push_back("test_data_" + std::to_string(i));
}
void* buffer = nullptr;
void* target_buffer = nullptr;
batched_slices.reserve(batch_sz);
for (int i = 0; i < batch_sz; i++) {
std::vector<Slice> slices;
buffer = client_buffer_allocator_->allocate(test_data_list[i].size());
memcpy(buffer, test_data_list[i].data(), test_data_list[i].size());
slices.emplace_back(Slice{buffer, test_data_list[i].size()});
batched_slices.push_back(std::move(slices));
}
// Test Batch Put operation
ReplicateConfig config;
config.replica_num = 1;
auto start = std::chrono::high_resolution_clock::now();
auto batch_put_results =
test_client_->BatchPut(keys, batched_slices, config);
// Check that all operations succeeded
for (const auto& result : batch_put_results) {
ASSERT_TRUE(result.has_value()) << "BatchPut operation failed";
}
auto end = std::chrono::high_resolution_clock::now();
LOG(INFO) << "Time taken for BatchPut: "
<< std::chrono::duration_cast<std::chrono::microseconds>(end -
start)
.count()
<< "us";
for (int i = 0; i < batch_sz; i++) {
keys.push_back("test_key_batch_put_" + std::to_string(i));
test_data_list.push_back("test_data_" + std::to_string(i));
}
void* buffer = nullptr;
void* target_buffer = nullptr;
batched_slices.reserve(batch_sz);
for (int i = 0; i < batch_sz; i++) {
std::vector<Slice> slices;
buffer = client_buffer_allocator_->allocate(test_data_list[i].size());
memcpy(buffer, test_data_list[i].data(), test_data_list[i].size());
slices.emplace_back(Slice{buffer, test_data_list[i].size()});
batched_slices.push_back(std::move(slices));
}
// Test Batch Put operation
ReplicateConfig config;
config.replica_num = 1;
auto start = std::chrono::high_resolution_clock::now();
auto batch_put_results =
test_client_->BatchPut(keys, batched_slices, config);
// Check that all operations succeeded
for (const auto& result : batch_put_results) {
ASSERT_TRUE(result.has_value()) << "BatchPut operation failed";
}
auto end = std::chrono::high_resolution_clock::now();
LOG(INFO) << "Time taken for BatchPut: "
<< std::chrono::duration_cast<std::chrono::microseconds>(end -
start)
.count()
<< "us";
// Clean up batch put buffers
for (int i = 0; i < batch_sz; i++) {
client_buffer_allocator_->deallocate(batched_slices[i][0].ptr,
test_data_list[i].size());
}
🤖 Prompt for AI Agents
In `@mooncake-store/tests/cxl_client_integration_test.cpp` around lines 300 - 329,
The test leaks memory because buffers allocated via
client_buffer_allocator_->allocate for each Slice stored in batched_slices are
never deallocated; after the BatchPut call completes (test_client_->BatchPut),
iterate the batched_slices (and any target_buffer if used) and call the matching
deallocation on the allocator for each Slice pointer (e.g., deallocate the void*
stored in each Slice) to free all allocated buffers before the test ends; ensure
you handle all entries in batched_slices and clear the container afterward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants