-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
area/cloudCloud provider integrationsCloud provider integrationsepicLarge feature spanning multiple issuesLarge feature spanning multiple issuespriority/highHigh priorityHigh prioritytype/featureNew feature or functionalityNew feature or functionality
Description
Overview
Transform roboflow into a distributed, fault-tolerant system using TiKV for coordination and shared-nothing compute architecture.
Design Documents
📋 Updated Roadmap: See DISTRIBUTED_DESIGN.md for the 10 Gbps throughput design.
🗺️ Issue Alignment: See ROADMAP_ALIGNMENT.md for mapping legacy phases to the new 5-phase roadmap.
Key Characteristics
- Shared-Nothing Compute: All worker pods are identical peers, no central master
- State-Externalized: All state (jobs, locks, checkpoints) in TiKV
- Reactive Streaming: Async streams with backpressure, bounded memory
- Hardware Aware: Separate CPU (parsing) and GPU (encoding) workloads
- Spot-Friendly: Checkpointing enables Spot Instance usage
- 10 Gbps Target: Designed for high-throughput processing (~1125 files/hour at 4GB each)
Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ Alibaba Cloud Infrastructure │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────────────────────────────────────┐ │
│ │ Web UI │───▶│ ACK/EKS Cluster │ │
│ │ (Monitor) │ │ │ │
│ └──────────────┘ │ ┌────────────────────────────────────────┐ │ │
│ │ │ Scanner Actor (Leader-elected) │ │ │
│ ┌──────────────┐ │ │ - List OSS for new files │ │ │
│ │ CLI │───▶│ │ - Insert jobs into TiKV │ │ │
│ │ (Submit/Mgmt)│ │ └────────────────────────────────────────┘ │ │
│ └──────────────┘ │ │ │
│ │ ┌────────────────────────────────────────┐ │ │
│ ┌──────────────┐ │ │ Worker Pods (N identical peers) │ │ │
│ │ OSS (Input) │◀──▶│ │ - Claim jobs via TiKV CAS │ │ │
│ │ /raw-data/ │ │ │ - Stream from OSS (range requests) │ │ │
│ └──────────────┘ │ │ - Checkpoint progress to TiKV │ │ │
│ │ │ - Multipart upload to OSS │ │ │
│ ┌──────────────┐ │ └────────────────────────────────────────┘ │ │
│ │ OSS (Output) │◀──▶│ │ │
│ │ /lerobot/ │ │ ┌────────────────────────────────────────┐ │ │
│ └──────────────┘ │ │ API Server │ │ │
│ │ │ - REST API for job management │ │ │
│ ┌──────────────┐ │ │ - Web UI for monitoring │ │ │
│ │ TiKV │◀──▶│ └────────────────────────────────────────┘ │ │
│ │ Cluster │ │ │ │
│ └──────────────┘ └──────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
TiKV Data Model
| Key Pattern | Purpose |
|---|---|
/jobs/{hash} |
Job definition and status |
/locks/{hash} |
Distributed locks with TTL |
/state/{hash} |
Frame-level checkpoint |
/heartbeat/{pod_id} |
Worker liveness |
/system/scanner_lock |
Scanner leadership |
Implementation Phases
Phase 1-3: Storage & LeRobot ✅ COMPLETE
- [Phase 1.1] Add core dependencies for storage abstraction #10, [Phase 1.2] Define Storage trait and error types #11, [Phase 1.3] Implement LocalStorage backend #23, [Phase 1.4] Implement URL/path parsing for storage backends #24, [Phase 1.5] Create StorageFactory for backend instantiation #25 - Storage Foundation
- [Phase 2.2] Implement multipart upload for large files #12, [Phase 2.1] Implement OSS/S3 backend using object_store #13, [Phase 2.3] Add retry logic and error handling for cloud operations #14, [Phase 2.4] Implement cached storage backend with local buffer #15 - Cloud Storage
- [Phase 3.1] Refactor LeRobotWriter to accept Storage backend #16, [Phase 3.2] Implement parallel episode upload with progress tracking #17 - LeRobot Writer
Phase 4: TiKV Coordination Layer ✅ COMPLETE
- [Phase 4.1] Add TiKV client and define distributed schema #40 - Add TiKV client and define schema
- [Phase 4.2] Implement distributed lock manager with TTL #41 - Implement distributed lock manager
- [Phase 4.3] Implement Scanner actor with leader election #42 - Implement Scanner actor with leader election
- [Phase 4.4] Implement Worker loop with job claiming #43 - Implement Worker loop with job claiming
- [Phase 4.5] Implement heartbeat and zombie detection #44 - Implement heartbeat and zombie detection
Phase 5: Checkpointing System ✅ COMPLETE
- [Phase 5] Frame-level checkpoint with TiKV and multipart resume #19 - Frame-level checkpoint with TiKV and multipart resume
Phase 6: Storage Enhancements ✅ COMPLETE
- [Phase 6.1] Add streaming S3 reader with range requests #45 - Add streaming S3 reader with range requests
- [Phase 6.2] Add parallel multipart uploads #46 - Add parallel multipart uploads
- [Phase 5.1] Add storage support to StreamingDatasetConverter #26 - Add storage support to StreamingDatasetConverter
- [Phase 5.2] Update CLI to accept cloud URLs #27 - Update CLI for cloud URLs
Phase 7: Pipeline Integration 🚧 IN PROGRESS
- [Phase 1] Integrate LerobotWriter with Worker.process_job() #72 - Integrate LerobotWriter with Worker.process_job() ⭐ CRITICAL
- [Phase 1] Add checkpoint save during pipeline processing #73 - Add checkpoint save during pipeline processing
- [Phase 7.1] Integrate pipeline with checkpoint hooks [PARTIALLY COMPLETE] #47 - Integrate pipeline with checkpoint hooks
- [Phase 7.2] Add graceful shutdown handling [READY 🔥] #48 - Add graceful shutdown handling
Phase 8: GPU Acceleration (Optional)
- [Phase 8] Add NVENC GPU video encoding [PARALLEL ⏳] #49 - Add NVENC video encoding support
Phase 9: Kubernetes Deployment
- [Phase 9.1] Implement long-running Worker Deployment [READY TO START] #18 - Long-running Worker Deployment
- [Phase 6.2] Create container image and Helm chart [BLOCKED by #18] #20 - Create container image and Helm chart
Phase 10: CLI & Web UI
- [Phase 10.1] Add CLI for job submission [READY TO START] #50 - CLI for job submission and management
- [Phase 10.2] Add web UI for job monitoring and management #51 - Web UI for job monitoring and management
Phase 11: Observability
- [Phase 7.1] Add Prometheus metrics for monitoring #21 - Add Prometheus metrics
- [Phase 7.2] Add structured logging with SLS integration #22 - Add structured logging
New 5-Phase Roadmap (from DISTRIBUTED_DESIGN.md)
| New Phase | Description | Key Issues |
|---|---|---|
| Phase 1 | Pipeline Integration | #72, #73, #47, #48 |
| Phase 2 | Prefetch Pipeline | (Future issues) |
| Phase 3 | GPU Acceleration | #49 |
| Phase 4 | Production Hardening | #20, #21, #22 |
| Phase 5 | Multi-Format Support | (KPS already implemented) |
Dependency Graph
Phase 1-6: ✅ COMPLETE
Phase 7: Pipeline Integration (CURRENT PRIORITY)
#72 (LerobotWriter) ─► #73 (Checkpoint Save) ─► #47 (Pipeline Hooks)
└─► #48 (Graceful Shutdown)
Phase 8: GPU (parallel, optional)
#47 ─► #49 (NVENC)
Phase 9: Kubernetes
#47 + #48 ─► #18 ─► #20
Phase 10: CLI & Web UI
#40 ─► #50 (CLI) ─► #51 (Web UI)
Phase 11: Observability
#47 ─► #21 (Metrics)
#18 ─► #22 (Logging)
User Interaction
CLI Commands
# Submit jobs
roboflow submit oss://bucket/path/*.mcap --output oss://bucket/output/
# Manage jobs
roboflow jobs list --status failed
roboflow jobs retry <job-id>
roboflow jobs cancel <job-id>
# Start worker
roboflow worker
# Start UI server
roboflow ui --port 8080Web UI Features
- Dashboard with job status overview
- Job list with filtering and search
- Job detail with progress and logs
- Retry/Cancel/Delete actions
- Worker status monitoring
Success Criteria
- Process 1000+ files from OSS without data loss
- Survive random pod terminations (Spot Instance friendly)
- Horizontal scaling with N worker pods
- Resume interrupted jobs from frame-level checkpoint
- CLI for job submission and management
- Web UI for monitoring and operations
- < 10% overhead compared to local processing
- Full observability with metrics and logs
- 10 Gbps throughput with 20-24 GPU workers
Architecture Benefits
- Resilience: Unplug half the cluster, remaining nodes resume work
- Scalability: Change replicas from 20 to 100, no config changes
- Cost Efficiency: Spot Instances + GPU acceleration
- Simplicity: One Rust binary + TiKV, no Hadoop/Spark
- Operability: CLI and Web UI for easy management
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area/cloudCloud provider integrationsCloud provider integrationsepicLarge feature spanning multiple issuesLarge feature spanning multiple issuespriority/highHigh priorityHigh prioritytype/featureNew feature or functionalityNew feature or functionality