-
Notifications
You must be signed in to change notification settings - Fork 433
Description
Changes proposed
This RFC proposes the design of a new feature: support contributing local SSD storage to the distributed store pool, which introduces a unified storage backend interface with multiple backends, file storage Get/Put/Eviction workflows, and master coordination.
Motivation
Currently, Mooncake relies heavily on in-memory storage for fast access. However, memory capacity is limited and expensive. Adding support for local SSD-based storage enables:
- Larger total storage capacity.
- Tiered data access
- Improved data persistence and recovery.
- Flexibility in resource allocation between heterogeneous clients.
Design Overview
1. Storage Backend: Unified Interface
All backends will implement a unified storage interface to simplify integration and extension.
We plan to support three backend implementations:
-
File-per-key Backend (Already supported)
- Each key-value pair is stored in a separate file.
- Simple but inefficient for large-scale data.
-
Bucket Backend @zhuxinjie-nz
- Multiple key-value pairs are grouped into a single file (“bucket”).
- Write and eviction are performed at the bucket level, improving I/O efficiency.
-
OffsetAllocator Backend
- A large pre-allocated file acts as a storage arena.
- Space is allocated and released using an
OffsetAllocator. - Enables efficient space management and fewer file operations.
2. Get Workflow @zhuxinjie-nz
-
Client A issues a
Getrequest. -
Master returns the replica information.
-
Client A attempts to read from a memory replica first.
-
If only remote file replica exists (located on Client B), then:
- A sends an RPC to B.
- B reads the requested data from local SSD into its local buffer memory.
- B returns the buffer address to A. B will ensure this address will not write other data before a certain time (e.g. 5s).
- A performs an RDMA read to fetch the data directly.
- If the read completes before timeout → A returns OK.
- If it times out → A returns Error.
3. Put Workflow @zhuxinjie-nz
- Client A issues a
Putrequest. - Master assigns the target replica to Client B.
- A performs an RDMA write to B.
- Once the write succeeds, A sends a PutEnd notification to Master.
- Upon receiving
PutEnd, Master adds the key to B’s persistence queue. - B periodically requests pending persistence tasks from Master.
- B obtains the persistence request, performs a BatchGetReplica to acquire the lease, and writes the data to local SSD.
- Once successfully persisted, B notifies Master with another PutEnd message.
Load Balancing Considerations
Clients may have heterogeneous resource configurations:
- Client B: large memory, limited or no SSD.
- Client C: minimal memory, large SSD.
To optimize for this diversity, future versions will enhance Master’s scheduling logic to assign persistence tasks to the most suitable clients. The rest of the flow remains unchanged.
4. Eviction Workflow
-
Each client manages its local SSD storage usage.
-
When nearing capacity, the client initiates Eviction:
- The client sends a
Removerequest to Master. - Upon successful confirmation, the client deletes the local data.
- The client sends a
5. Initialization Workflow
Upon startup, each client:
- Reads local file metadata.
- Validates file integrity.
- Reports valid replicas back to Master via
Putrequests.
This ensures that valid persisted data is re-registered after restarts or failures.
6. Master Modifications
The Master component requires several extensions:
- Associate SSD segment information to each client.
- If a client has an SSD, associate it with a persistence queue.
- The replica metadata structure will record ip address and replica size, which provides necessary information for clients to issue remote file read.
This feature is under active development. Suggestions and contributions are appreciated.
Before submitting a new issue...
- Make sure you already searched for relevant issues and read the documentation