[Store] feat: Implement a FileStorage component to manage the lifecycle of key-value data #1031

zhuxinjie-nz · 2025-11-06T18:26:41Z

The FileStorage class will be responsible for managing the entire lifecycle of key-value objects, including:

Storing new objects (Put)
Retrieving existing objects (Get)
Deleting objects
Tracking object metadata (size, location, timestamp)
Notifying the master node of metadata changes (e.g., after offload completion)

…le of key-value data, including creation, reading, and synchronizing metadata with the master.

gemini-code-assist · 2025-11-06T18:27:06Z

Summary of Changes

Hello @zhuxinjie-nz, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new FileStorage component designed to manage the persistent storage and lifecycle of key-value data. It integrates with the existing Client to communicate with a master node for offloading and metadata synchronization. The changes also involve significant refactoring of the BucketStorageBackend to streamline metadata handling and introduce environment-variable-based configuration for the new storage component.

Highlights

New FileStorage Component: Introduced a new FileStorage class responsible for managing the entire lifecycle of key-value data, including storage, retrieval, deletion, metadata tracking, and master node notifications for offloading.
Client API Extensions: The Client class has been extended with new methods (MountFileStorage, OffloadObjectHeartbeat, BatchPutOffloadObject, NotifyOffloadSuccess) to facilitate communication and coordination with the master node for file storage operations.
Refactored BucketStorageBackend: The BucketStorageBackend has been updated to streamline metadata handling, now using std::vector<StorageObjectMetadata> for object metadata within BucketMetadata and adjusting method signatures for BatchOffload, BatchLoad, and BucketScan accordingly.
Environment Variable Configuration: The FileStorageConfig struct now supports loading configuration parameters (e.g., storage paths, buffer sizes, limits, heartbeat intervals) directly from environment variables, enhancing deployment flexibility.
New Error Code and Testing: A new error code KEYS_ULTRA_LIMIT has been added, and a dedicated test file (file_storage_test.cpp) was introduced to validate the FileStorageConfig's environment variable parsing and configuration validation logic.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new FileStorage component to manage the lifecycle of key-value data, including storage, retrieval, and offloading. The implementation is well-structured, with a dedicated configuration class, a FileStorage class to orchestrate operations, and a BucketIterator. The related changes to storage_backend are consistent and improve correctness with new checks. The addition of unit tests for the new configuration is also a positive step.

However, the review identified several critical issues concerning the handling of tl::expected return values. In multiple places, the code calls .value() without first checking if an error is present, which will lead to application crashes. There is also a high-severity portability issue in the test code related to environment variable handling on Windows, and some medium-severity typos. Addressing these issues is crucial for the stability and correctness of the new component.

gemini-code-assist · 2025-11-06T18:29:03Z

mooncake-store/src/file_storage.cpp

+    if (!allocate_res) {
+        LOG(ERROR) << "Failed to allocate batch objects, target = "
+                   << transfer_engine_addr;
+    }


The code checks if allocate_res contains an error and logs it, but then proceeds to call allocate_res.value() on the next line. This will cause a crash if allocate_res is an error. The function should return early in the error case.

Suggested change

if (!allocate_res) {

LOG(ERROR) << "Failed to allocate batch objects, target = "

<< transfer_engine_addr;

}

if (!allocate_res) {

LOG(ERROR) << "Failed to allocate batch objects, target = "

<< transfer_engine_addr;

return tl::make_unexpected(allocate_res.error());

}

gemini-code-assist · 2025-11-06T18:29:03Z

mooncake-store/src/file_storage.cpp

+        if (!enable_offloading_result) {
+            LOG(ERROR) << "Get is enable offloading failed with error: "
+                       << enable_offloading_result.error();
+        }


If enable_offloading_result contains an error, it is logged, but the code proceeds to call .value() on it, which will cause a crash. The error should be handled, for example by returning from the function.

if (!enable_offloading_result) { LOG(ERROR) << "Get is enable offloading failed with error: " << enable_offloading_result.error(); return tl::make_unexpected(enable_offloading_result.error()); }

gemini-code-assist · 2025-11-06T18:29:03Z

mooncake-store/src/file_storage.cpp

+            if (!is_exist_result) {
+                LOG(ERROR) << "Failed to check existence in storage backend: "
+                           << "key=" << it->first
+                           << ", error=" << is_exist_result.error();
+            }


If is_exist_result contains an error, it is logged, but the code proceeds to call .value() on it, which will cause a crash. The error should be handled. You could propagate the error or continue to the next item in the loop.

if (!is_exist_result) { LOG(ERROR) << "Failed to check existence in storage backend: " << "key=" << it->first << ", error=" << is_exist_result.error(); return tl::make_unexpected(is_exist_result.error()); }

gemini-code-assist · 2025-11-06T18:29:04Z

mooncake-store/src/file_storage.cpp

+
+    BucketIterator bucket_iterator(storage_backend_,
+                                   config_.bucket_iterator_keys_limit);
+    while (bucket_iterator.HasNext().value()) {


Calling .value() on a tl::expected object without first checking for an error can cause a crash if an error is present. The bucket_iterator.HasNext() method returns a tl::expected<bool, ErrorCode>, which should be checked before accessing its value.

A safer way to write this loop would be:

while (true) { auto has_next_res = bucket_iterator.HasNext(); if (!has_next_res) { LOG(ERROR) << "Failed to check for next bucket: " << has_next_res.error(); return tl::make_unexpected(has_next_res.error()); } if (!has_next_res.value()) { break; } // ... rest of the loop body }

gemini-code-assist · 2025-11-06T18:29:04Z

mooncake-store/tests/file_storage_test.cpp

+void UnsetEnv(const std::string& key) {
+#ifdef _WIN32
+    unsetenv(key.c_str());
+#else
+    unsetenv(key.c_str());
+#endif
+}


The unsetenv function is not standard on Windows. The current implementation uses unsetenv for both Windows (_WIN32) and other platforms, which will cause a compilation failure on Windows. To unset an environment variable on Windows, you should use _putenv_s(key.c_str(), "");.

Suggested change

void UnsetEnv(const std::string& key) {

#ifdef _WIN32

unsetenv(key.c_str());

#else

unsetenv(key.c_str());

#endif

}

void UnsetEnv(const std::string& key) {

#ifdef _WIN32

_putenv_s(key.c_str(), "");

#else

unsetenv(key.c_str());

#endif

}

gemini-code-assist · 2025-11-06T18:29:04Z

mooncake-store/include/file_storage.h

+    /**
+     * @brief Groups offloading keys into buckets based on size and existence
+     * checks.
+     * @param @param offloading_objects Input map of object keys and their sizes


There is a duplicate @param tag in the Doxygen comment.

* @param offloading_objects Input map of object keys and their sizes

gemini-code-assist · 2025-11-06T18:29:04Z

mooncake-store/include/file_storage.h

+        const std::vector<std::string>& keys);
+
+    tl::expected<void, ErrorCode> BatchLoad(
+        const std::unordered_map<std::string, Slice>& batche_object);


There is a typo in the parameter name batche_object. It should be batch_object.

const std::unordered_map<std::string, Slice>& batch_object);

gemini-code-assist · 2025-11-06T18:29:04Z

mooncake-store/src/file_storage.cpp

+}
+
+tl::expected<void, ErrorCode> FileStorage::BatchLoad(
+    const std::unordered_map<std::string, Slice>& batche_object) {


There is a typo in the parameter name batche_object. It should be batch_object to be consistent with the rest of the codebase and to match the fix in the header file.

const std::unordered_map<std::string, Slice>& batch_object) {

xiaguan · 2025-11-07T04:13:38Z

PR Review Summary

I've completed a thorough review of this FileStorage implementation. Overall, the architecture is solid, but there are critical issues that must be addressed before merging.

🔴 Critical Issues (P0 - Must Fix)

1. Stub Implementations - Blocking Issue

Four critical Client methods are not implemented (mooncake-store/src/client.cpp:1524-1554):

MountFileStorage()
OffloadObjectHeartbeat()
BatchPutOffloadObject()
NotifyOffloadSuccess()

All these methods just return {} with a TODO comment. The entire FileStorage system cannot function without these implementations.

Impact: FileStorage::Init() calls MountFileStorage at line 141, FileStorage::Heartbeat() calls OffloadObjectHeartbeat at line 296, etc. These are core functionalities.

2. Thread Safety Violation

In file_storage.cpp:297, enable_offloading_ is accessed without holding the mutex:

auto heartbeat_result = client_->OffloadObjectHeartbeat(
    segment_name_, enable_offloading_, offloading_objects);  // Race condition!

The field is marked GUARDED_BY(offloading_mutex_) but read without lock protection, creating a data race.

Fix: Acquire the mutex before reading, or copy the value under lock.

3. Silent Error Swallowing

In file_storage.cpp:253-257, BatchOffload errors are logged but not propagated:

auto result = BatchOffload(keys);
if (!result) {
    LOG(ERROR) << "Failed to store objects with error: " << result.error();
}
// Function continues and returns {} despite error!

Fix: Return the error to the caller.

🟡 Important Issues (P1 - Should Fix)

4. Typo in Error Code

KEYS_ULTRA_BUCKET_LIMIT should be KEYS_EXCEED_BUCKET_LIMIT (appears in types.h:1203, storage_backend.cpp:584)

5. Missing Integration Tests

file_storage_test.cpp only tests config parsing. No tests for:

FileStorage core functionality (Init, BatchGet, Heartbeat)
Client interaction scenarios
Concurrent access patterns
Error recovery paths

6. Security: Path Validation

storage_filepath from environment is not validated and could be exploited for path traversal attacks.

7. Type Inconsistency

In file_storage.cpp:422, loop uses int64_t but should use size_t to match keys.size():

for (int64_t i = 0; i < keys.size(); ++i) {  // Type mismatch

✅ Strengths

Well-structured architecture: Clean layering with FileStorage → BucketStorageBackend
Bucket-based storage: Smart grouping with configurable limits (256MB, 500 keys per bucket)
Heartbeat mechanism: Good design for master synchronization
Environment-based config: Flexible deployment configuration
Improved storage backend: Better metadata structure (parallel vectors vs hash map)
Error recovery: Init() now detects and cleans up corrupted buckets

📝 Recommendations (P2 - Nice to Have)

Add class-level documentation for FileStorage
Document metadata migration strategy for schema changes
Consider adaptive heartbeat interval based on load
Add metrics/observability for operations
Remove hardcoded sleep in storage_backend.cpp:194 (timing-dependent bugs)

Verdict

Status: ⚠️ Cannot Merge - Critical blockers present

The architecture and design are solid, but the stub implementations make this non-functional. Additionally, the thread safety issue could cause production incidents.

Estimated work: 1-2 days to implement the four Client methods + fix critical issues, plus another day for comprehensive testing.

Once P0 items are resolved, this will be a strong addition to the codebase! 👍

zhuxinjie-nz · 2025-11-07T09:04:47Z

Client APIs are ready but will be submitted in the next PR to keep changes focused. FileStorage currently has no external read/write surface — this change is safe and non-breaking.

…or metadata changes

zhuxinjie-nz added 2 commits November 7, 2025 01:53

[Store] feat: Implement a FileStorage component to manage the lifecyc…

d31a3e4

…le of key-value data, including creation, reading, and synchronizing metadata with the master.

Merge branch 'main' into file-storage-dev

37a53af

gemini-code-assist bot reviewed Nov 6, 2025

View reviewed changes

zhuxinjie-nz added 3 commits November 7, 2025 02:42

fix(store): Error handling logic in FileStorage class

cb1d694

fix(store): clang format

067d20b

fix(store): StorageBackend clang format

623135a

zhuxinjie-nz added 5 commits November 11, 2025 20:29

[Store] feat: Add test cases for FileStorage

cc6325a

fix(store): Improve error handling logic in FileStorage::BatchOffload

4a8db30

Merge remote-tracking branch 'origin/main' into file-storage-dev

3f5983d

fix(store): Make FileStorage compatible with master's buffer_descript…

f34e42c

…or metadata changes

fix(store): FileStorage Clang Format

44f8da9

ykwd mentioned this pull request Nov 13, 2025

[RFC] Contribute Local Storage to Distributed Pool #1054

Open

1 task

[Store] feat: Implement a FileStorage component to manage the lifecycle of key-value data #1031

Are you sure you want to change the base?

[Store] feat: Implement a FileStorage component to manage the lifecycle of key-value data #1031

Uh oh!

Conversation

zhuxinjie-nz commented Nov 6, 2025

Uh oh!

gemini-code-assist bot commented Nov 6, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

xiaguan commented Nov 7, 2025

PR Review Summary

🔴 Critical Issues (P0 - Must Fix)

1. Stub Implementations - Blocking Issue

2. Thread Safety Violation

3. Silent Error Swallowing

🟡 Important Issues (P1 - Should Fix)

4. Typo in Error Code

5. Missing Integration Tests

6. Security: Path Validation

7. Type Inconsistency

✅ Strengths

📝 Recommendations (P2 - Nice to Have)

Verdict

Uh oh!

zhuxinjie-nz commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants