-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
area/lerobotLeRobot dataset formatLeRobot dataset formatarea/storageStorage layer and backendsStorage layer and backendspriority/highHigh priorityHigh prioritysize/MMedium: 3-5 daysMedium: 3-5 daysstatus/readyReady to be picked upReady to be picked uptype/featureNew feature or functionalityNew feature or functionality
Description
Problem
Workers in the distributed system currently treat config_hash as a local file path. In a distributed environment where workers run in separate pods/machines, they don't have access to the submit node's filesystem. This causes workers to fall back to empty configs, resulting in 0 frames written.
Solution
Store dataset configuration TOML content in TiKV using content-addressable storage (SHA-256 hash). Jobs reference configs by hash, and workers fetch the config content from TiKV.
Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ Submit Node │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Config File │ ──▶ │ Read & Hash │ ──▶ │ Store in TiKV│ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ SHA-256 Hash /roboflow/v1/configs/{hash} │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ JobRecord { │ │
│ │ id: "job-abc", │ │
│ │ config_hash: "a3f5b...", // ← hash reference, NOT file path │ │
│ │ ... │ │
│ │ } │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
│
│ Job Queue (TiKV)
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ Worker Node │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Claim Job │ ──▶ │ Get by Hash │ ──▶ │ Parse TOML │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ config_hash: get_config() LerobotConfig │
│ "a3f5b..." from TiKV from TOML │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Implementation Tasks
1. Submit Command (src/bin/commands/submit.rs)
- Add
load_or_store_config()helper function - Read config file content
- Compute SHA-256 hash
- Store
ConfigRecordin TiKV if not already present - Use hash as
config_hashin JobRecord
2. Worker (crates/roboflow-distributed/src/worker.rs)
- Change
create_lerobot_config()to async - Fetch config from TiKV using
config_hash - Parse TOML content to
LerobotConfig - Fail job if config not found (don't fall back to empty config)
3. Config Parsing (crates/roboflow-dataset/src/lerobot/config.rs)
- Add
from_toml(content: &str)method - Keep existing
from_file(path)for backward compatibility
Existing Infrastructure
The following components are already implemented and ready to use:
| Component | Location |
|---|---|
ConfigRecord struct |
crates/roboflow-distributed/src/tikv/schema.rs |
ConfigKeys::config() |
crates/roboflow-distributed/src/tikv/key.rs |
TikvClient::put_config() |
crates/roboflow-distributed/src/tikv/client.rs |
TikvClient::get_config() |
crates/roboflow-distributed/src/tikv/client.rs |
| SHA-256 hashing | ConfigRecord::compute_hash() |
Design Decisions
| Question | Decision |
|---|---|
| Config not found in TiKV? | Fail job immediately |
| Backward compatibility? | Detect hash vs path (64-char hex = hash) |
| Config validation? | Parse TOML on submit before storing |
| Config updates? | Immutable (new content = new hash) |
| Caching? | Optional: LRU cache in worker for same config |
Files to Modify
src/bin/commands/submit.rscrates/roboflow-distributed/src/worker.rscrates/roboflow-dataset/src/lerobot/config.rs
Related
- Existing TiKV infrastructure in
roboflow-distributedcrate
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area/lerobotLeRobot dataset formatLeRobot dataset formatarea/storageStorage layer and backendsStorage layer and backendspriority/highHigh priorityHigh prioritysize/MMedium: 3-5 daysMedium: 3-5 daysstatus/readyReady to be picked upReady to be picked uptype/featureNew feature or functionalityNew feature or functionality