Skip to content

feat: store dataset configs in TiKV for distributed workers #87

@zhexuany

Description

@zhexuany

Problem

Workers in the distributed system currently treat config_hash as a local file path. In a distributed environment where workers run in separate pods/machines, they don't have access to the submit node's filesystem. This causes workers to fall back to empty configs, resulting in 0 frames written.

Solution

Store dataset configuration TOML content in TiKV using content-addressable storage (SHA-256 hash). Jobs reference configs by hash, and workers fetch the config content from TiKV.

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                              Submit Node                                     │
│  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐                │
│  │  Config File │ ──▶ │ Read & Hash │ ──▶ │ Store in TiKV│                │
│  └──────────────┘     └──────────────┘     └──────────────┘                │
│                              │                      │                        │
│                              ▼                      ▼                        │
│                        SHA-256 Hash        /roboflow/v1/configs/{hash}      │
│                              │                      │                        │
│                              ▼                      ▼                        │
│  ┌──────────────────────────────────────────────────────────────────────┐  │
│  │  JobRecord {                                                          │  │
│  │      id: "job-abc",                                                   │  │
│  │      config_hash: "a3f5b...",  // ← hash reference, NOT file path    │  │
│  │      ...                                                              │  │
│  │  }                                                                    │  │
│  └──────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    │ Job Queue (TiKV)
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                              Worker Node                                     │
│  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐                │
│  │ Claim Job    │ ──▶ │ Get by Hash  │ ──▶ │ Parse TOML   │                │
│  └──────────────┘     └──────────────┘     └──────────────┘                │
│         │                   │                   │                          │
│         ▼                   ▼                   ▼                          │
│   config_hash:      get_config()       LerobotConfig                        │
│   "a3f5b..."          from TiKV          from TOML                          │
│                                                                           │
└─────────────────────────────────────────────────────────────────────────────┘

Implementation Tasks

1. Submit Command (src/bin/commands/submit.rs)

  • Add load_or_store_config() helper function
  • Read config file content
  • Compute SHA-256 hash
  • Store ConfigRecord in TiKV if not already present
  • Use hash as config_hash in JobRecord

2. Worker (crates/roboflow-distributed/src/worker.rs)

  • Change create_lerobot_config() to async
  • Fetch config from TiKV using config_hash
  • Parse TOML content to LerobotConfig
  • Fail job if config not found (don't fall back to empty config)

3. Config Parsing (crates/roboflow-dataset/src/lerobot/config.rs)

  • Add from_toml(content: &str) method
  • Keep existing from_file(path) for backward compatibility

Existing Infrastructure

The following components are already implemented and ready to use:

Component Location
ConfigRecord struct crates/roboflow-distributed/src/tikv/schema.rs
ConfigKeys::config() crates/roboflow-distributed/src/tikv/key.rs
TikvClient::put_config() crates/roboflow-distributed/src/tikv/client.rs
TikvClient::get_config() crates/roboflow-distributed/src/tikv/client.rs
SHA-256 hashing ConfigRecord::compute_hash()

Design Decisions

Question Decision
Config not found in TiKV? Fail job immediately
Backward compatibility? Detect hash vs path (64-char hex = hash)
Config validation? Parse TOML on submit before storing
Config updates? Immutable (new content = new hash)
Caching? Optional: LRU cache in worker for same config

Files to Modify

  • src/bin/commands/submit.rs
  • crates/roboflow-distributed/src/worker.rs
  • crates/roboflow-dataset/src/lerobot/config.rs

Related

  • Existing TiKV infrastructure in roboflow-distributed crate

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions