Skip to content

Conversation

@zhuxinjie-nz
Copy link
Collaborator

The FileStorage class will be responsible for managing the entire lifecycle of key-value objects, including:

  • Storing new objects (Put)
  • Retrieving existing objects (Get)
  • Deleting objects
  • Tracking object metadata (size, location, timestamp)
  • Notifying the master node of metadata changes (e.g., after offload completion)

…le of key-value data, including creation, reading, and synchronizing metadata with the master.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @zhuxinjie-nz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new FileStorage component designed to manage the persistent storage and lifecycle of key-value data. It integrates with the existing Client to communicate with a master node for offloading and metadata synchronization. The changes also involve significant refactoring of the BucketStorageBackend to streamline metadata handling and introduce environment-variable-based configuration for the new storage component.

Highlights

  • New FileStorage Component: Introduced a new FileStorage class responsible for managing the entire lifecycle of key-value data, including storage, retrieval, deletion, metadata tracking, and master node notifications for offloading.
  • Client API Extensions: The Client class has been extended with new methods (MountFileStorage, OffloadObjectHeartbeat, BatchPutOffloadObject, NotifyOffloadSuccess) to facilitate communication and coordination with the master node for file storage operations.
  • Refactored BucketStorageBackend: The BucketStorageBackend has been updated to streamline metadata handling, now using std::vector<StorageObjectMetadata> for object metadata within BucketMetadata and adjusting method signatures for BatchOffload, BatchLoad, and BucketScan accordingly.
  • Environment Variable Configuration: The FileStorageConfig struct now supports loading configuration parameters (e.g., storage paths, buffer sizes, limits, heartbeat intervals) directly from environment variables, enhancing deployment flexibility.
  • New Error Code and Testing: A new error code KEYS_ULTRA_LIMIT has been added, and a dedicated test file (file_storage_test.cpp) was introduced to validate the FileStorageConfig's environment variable parsing and configuration validation logic.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new FileStorage component to manage the lifecycle of key-value data, including storage, retrieval, and offloading. The implementation is well-structured, with a dedicated configuration class, a FileStorage class to orchestrate operations, and a BucketIterator. The related changes to storage_backend are consistent and improve correctness with new checks. The addition of unit tests for the new configuration is also a positive step.

However, the review identified several critical issues concerning the handling of tl::expected return values. In multiple places, the code calls .value() without first checking if an error is present, which will lead to application crashes. There is also a high-severity portability issue in the test code related to environment variable handling on Windows, and some medium-severity typos. Addressing these issues is crucial for the stability and correctness of the new component.

Comment on lines 192 to 195
if (!allocate_res) {
LOG(ERROR) << "Failed to allocate batch objects, target = "
<< transfer_engine_addr;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The code checks if allocate_res contains an error and logs it, but then proceeds to call allocate_res.value() on the next line. This will cause a crash if allocate_res is an error. The function should return early in the error case.

Suggested change
if (!allocate_res) {
LOG(ERROR) << "Failed to allocate batch objects, target = "
<< transfer_engine_addr;
}
if (!allocate_res) {
LOG(ERROR) << "Failed to allocate batch objects, target = "
<< transfer_engine_addr;
return tl::make_unexpected(allocate_res.error());
}

Comment on lines 231 to 234
if (!enable_offloading_result) {
LOG(ERROR) << "Get is enable offloading failed with error: "
<< enable_offloading_result.error();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

If enable_offloading_result contains an error, it is logged, but the code proceeds to call .value() on it, which will cause a crash. The error should be handled, for example by returning from the function.

        if (!enable_offloading_result) {
            LOG(ERROR) << "Get is enable offloading failed with error: "
                       << enable_offloading_result.error();
            return tl::make_unexpected(enable_offloading_result.error());
        }

Comment on lines +475 to +479
if (!is_exist_result) {
LOG(ERROR) << "Failed to check existence in storage backend: "
<< "key=" << it->first
<< ", error=" << is_exist_result.error();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

If is_exist_result contains an error, it is logged, but the code proceeds to call .value() on it, which will cause a crash. The error should be handled. You could propagate the error or continue to the next item in the loop.

            if (!is_exist_result) {
                LOG(ERROR) << "Failed to check existence in storage backend: "
                           << "key=" << it->first
                           << ", error=" << is_exist_result.error();
                return tl::make_unexpected(is_exist_result.error());
            }


BucketIterator bucket_iterator(storage_backend_,
config_.bucket_iterator_keys_limit);
while (bucket_iterator.HasNext().value()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling .value() on a tl::expected object without first checking for an error can cause a crash if an error is present. The bucket_iterator.HasNext() method returns a tl::expected<bool, ErrorCode>, which should be checked before accessing its value.

A safer way to write this loop would be:

while (true) {
    auto has_next_res = bucket_iterator.HasNext();
    if (!has_next_res) {
        LOG(ERROR) << "Failed to check for next bucket: " << has_next_res.error();
        return tl::make_unexpected(has_next_res.error());
    }
    if (!has_next_res.value()) {
        break;
    }
    // ... rest of the loop body
}

Comment on lines 16 to 22
void UnsetEnv(const std::string& key) {
#ifdef _WIN32
unsetenv(key.c_str());
#else
unsetenv(key.c_str());
#endif
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The unsetenv function is not standard on Windows. The current implementation uses unsetenv for both Windows (_WIN32) and other platforms, which will cause a compilation failure on Windows. To unset an environment variable on Windows, you should use _putenv_s(key.c_str(), "");.

Suggested change
void UnsetEnv(const std::string& key) {
#ifdef _WIN32
unsetenv(key.c_str());
#else
unsetenv(key.c_str());
#endif
}
void UnsetEnv(const std::string& key) {
#ifdef _WIN32
_putenv_s(key.c_str(), "");
#else
unsetenv(key.c_str());
#endif
}

/**
* @brief Groups offloading keys into buckets based on size and existence
* checks.
* @param @param offloading_objects Input map of object keys and their sizes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a duplicate @param tag in the Doxygen comment.

     * @param offloading_objects Input map of object keys and their sizes

const std::vector<std::string>& keys);

tl::expected<void, ErrorCode> BatchLoad(
const std::unordered_map<std::string, Slice>& batche_object);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in the parameter name batche_object. It should be batch_object.

        const std::unordered_map<std::string, Slice>& batch_object);

}

tl::expected<void, ErrorCode> FileStorage::BatchLoad(
const std::unordered_map<std::string, Slice>& batche_object) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in the parameter name batche_object. It should be batch_object to be consistent with the rest of the codebase and to match the fix in the header file.

    const std::unordered_map<std::string, Slice>& batch_object) {

@xiaguan
Copy link
Collaborator

xiaguan commented Nov 7, 2025

PR Review Summary

I've completed a thorough review of this FileStorage implementation. Overall, the architecture is solid, but there are critical issues that must be addressed before merging.


🔴 Critical Issues (P0 - Must Fix)

1. Stub Implementations - Blocking Issue

Four critical Client methods are not implemented (mooncake-store/src/client.cpp:1524-1554):

  • MountFileStorage()
  • OffloadObjectHeartbeat()
  • BatchPutOffloadObject()
  • NotifyOffloadSuccess()

All these methods just return {} with a TODO comment. The entire FileStorage system cannot function without these implementations.

Impact: FileStorage::Init() calls MountFileStorage at line 141, FileStorage::Heartbeat() calls OffloadObjectHeartbeat at line 296, etc. These are core functionalities.

2. Thread Safety Violation

In file_storage.cpp:297, enable_offloading_ is accessed without holding the mutex:

auto heartbeat_result = client_->OffloadObjectHeartbeat(
    segment_name_, enable_offloading_, offloading_objects);  // Race condition!

The field is marked GUARDED_BY(offloading_mutex_) but read without lock protection, creating a data race.

Fix: Acquire the mutex before reading, or copy the value under lock.

3. Silent Error Swallowing

In file_storage.cpp:253-257, BatchOffload errors are logged but not propagated:

auto result = BatchOffload(keys);
if (!result) {
    LOG(ERROR) << "Failed to store objects with error: " << result.error();
}
// Function continues and returns {} despite error!

Fix: Return the error to the caller.


🟡 Important Issues (P1 - Should Fix)

4. Typo in Error Code

KEYS_ULTRA_BUCKET_LIMIT should be KEYS_EXCEED_BUCKET_LIMIT (appears in types.h:1203, storage_backend.cpp:584)

5. Missing Integration Tests

file_storage_test.cpp only tests config parsing. No tests for:

  • FileStorage core functionality (Init, BatchGet, Heartbeat)
  • Client interaction scenarios
  • Concurrent access patterns
  • Error recovery paths

6. Security: Path Validation

storage_filepath from environment is not validated and could be exploited for path traversal attacks.

7. Type Inconsistency

In file_storage.cpp:422, loop uses int64_t but should use size_t to match keys.size():

for (int64_t i = 0; i < keys.size(); ++i) {  // Type mismatch

✅ Strengths

  1. Well-structured architecture: Clean layering with FileStorage → BucketStorageBackend
  2. Bucket-based storage: Smart grouping with configurable limits (256MB, 500 keys per bucket)
  3. Heartbeat mechanism: Good design for master synchronization
  4. Environment-based config: Flexible deployment configuration
  5. Improved storage backend: Better metadata structure (parallel vectors vs hash map)
  6. Error recovery: Init() now detects and cleans up corrupted buckets

📝 Recommendations (P2 - Nice to Have)

  1. Add class-level documentation for FileStorage
  2. Document metadata migration strategy for schema changes
  3. Consider adaptive heartbeat interval based on load
  4. Add metrics/observability for operations
  5. Remove hardcoded sleep in storage_backend.cpp:194 (timing-dependent bugs)

Verdict

Status: ⚠️ Cannot Merge - Critical blockers present

The architecture and design are solid, but the stub implementations make this non-functional. Additionally, the thread safety issue could cause production incidents.

Estimated work: 1-2 days to implement the four Client methods + fix critical issues, plus another day for comprehensive testing.

Once P0 items are resolved, this will be a strong addition to the codebase! 👍

@zhuxinjie-nz
Copy link
Collaborator Author

Client APIs are ready but will be submitted in the next PR to keep changes focused. FileStorage currently has no external read/write surface — this change is safe and non-breaking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants