Skip to content

[Epic] Distributed Roboflow with Alibaba Cloud (OSS + ACK) #9

@zhexuany

Description

@zhexuany

Overview

Transform roboflow into a distributed, fault-tolerant system using TiKV for coordination and shared-nothing compute architecture.

Design Documents

📋 Updated Roadmap: See DISTRIBUTED_DESIGN.md for the 10 Gbps throughput design.

🗺️ Issue Alignment: See ROADMAP_ALIGNMENT.md for mapping legacy phases to the new 5-phase roadmap.

Key Characteristics

  • Shared-Nothing Compute: All worker pods are identical peers, no central master
  • State-Externalized: All state (jobs, locks, checkpoints) in TiKV
  • Reactive Streaming: Async streams with backpressure, bounded memory
  • Hardware Aware: Separate CPU (parsing) and GPU (encoding) workloads
  • Spot-Friendly: Checkpointing enables Spot Instance usage
  • 10 Gbps Target: Designed for high-throughput processing (~1125 files/hour at 4GB each)

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                        Alibaba Cloud Infrastructure                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌──────────────┐    ┌──────────────────────────────────────────────┐  │
│  │   Web UI     │───▶│              ACK/EKS Cluster                  │  │
│  │  (Monitor)   │    │                                               │  │
│  └──────────────┘    │  ┌────────────────────────────────────────┐  │  │
│                      │  │      Scanner Actor (Leader-elected)     │  │  │
│  ┌──────────────┐    │  │  - List OSS for new files               │  │  │
│  │   CLI        │───▶│  │  - Insert jobs into TiKV                │  │  │
│  │ (Submit/Mgmt)│    │  └────────────────────────────────────────┘  │  │
│  └──────────────┘    │                                               │  │
│                      │  ┌────────────────────────────────────────┐  │  │
│  ┌──────────────┐    │  │      Worker Pods (N identical peers)    │  │  │
│  │ OSS (Input)  │◀──▶│  │  - Claim jobs via TiKV CAS              │  │  │
│  │  /raw-data/  │    │  │  - Stream from OSS (range requests)     │  │  │
│  └──────────────┘    │  │  - Checkpoint progress to TiKV          │  │  │
│                      │  │  - Multipart upload to OSS              │  │  │
│  ┌──────────────┐    │  └────────────────────────────────────────┘  │  │
│  │ OSS (Output) │◀──▶│                                               │  │
│  │  /lerobot/   │    │  ┌────────────────────────────────────────┐  │  │
│  └──────────────┘    │  │           API Server                    │  │  │
│                      │  │  - REST API for job management          │  │  │
│  ┌──────────────┐    │  │  - Web UI for monitoring                │  │  │
│  │    TiKV      │◀──▶│  └────────────────────────────────────────┘  │  │
│  │   Cluster    │    │                                               │  │
│  └──────────────┘    └──────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────┘

TiKV Data Model

Key Pattern Purpose
/jobs/{hash} Job definition and status
/locks/{hash} Distributed locks with TTL
/state/{hash} Frame-level checkpoint
/heartbeat/{pod_id} Worker liveness
/system/scanner_lock Scanner leadership

Implementation Phases

Phase 1-3: Storage & LeRobot ✅ COMPLETE

Phase 4: TiKV Coordination Layer ✅ COMPLETE

Phase 5: Checkpointing System ✅ COMPLETE

Phase 6: Storage Enhancements ✅ COMPLETE

Phase 7: Pipeline Integration 🚧 IN PROGRESS

Phase 8: GPU Acceleration (Optional)

Phase 9: Kubernetes Deployment

Phase 10: CLI & Web UI

Phase 11: Observability

New 5-Phase Roadmap (from DISTRIBUTED_DESIGN.md)

New Phase Description Key Issues
Phase 1 Pipeline Integration #72, #73, #47, #48
Phase 2 Prefetch Pipeline (Future issues)
Phase 3 GPU Acceleration #49
Phase 4 Production Hardening #20, #21, #22
Phase 5 Multi-Format Support (KPS already implemented)

Dependency Graph

Phase 1-6: ✅ COMPLETE

Phase 7: Pipeline Integration (CURRENT PRIORITY)
#72 (LerobotWriter) ─► #73 (Checkpoint Save) ─► #47 (Pipeline Hooks)
                                              └─► #48 (Graceful Shutdown)

Phase 8: GPU (parallel, optional)
#47 ─► #49 (NVENC)

Phase 9: Kubernetes
#47 + #48 ─► #18 ─► #20

Phase 10: CLI & Web UI
#40 ─► #50 (CLI) ─► #51 (Web UI)

Phase 11: Observability
#47 ─► #21 (Metrics)
#18 ─► #22 (Logging)

User Interaction

CLI Commands

# Submit jobs
roboflow submit oss://bucket/path/*.mcap --output oss://bucket/output/

# Manage jobs
roboflow jobs list --status failed
roboflow jobs retry <job-id>
roboflow jobs cancel <job-id>

# Start worker
roboflow worker

# Start UI server
roboflow ui --port 8080

Web UI Features

  • Dashboard with job status overview
  • Job list with filtering and search
  • Job detail with progress and logs
  • Retry/Cancel/Delete actions
  • Worker status monitoring

Success Criteria

  • Process 1000+ files from OSS without data loss
  • Survive random pod terminations (Spot Instance friendly)
  • Horizontal scaling with N worker pods
  • Resume interrupted jobs from frame-level checkpoint
  • CLI for job submission and management
  • Web UI for monitoring and operations
  • < 10% overhead compared to local processing
  • Full observability with metrics and logs
  • 10 Gbps throughput with 20-24 GPU workers

Architecture Benefits

  1. Resilience: Unplug half the cluster, remaining nodes resume work
  2. Scalability: Change replicas from 20 to 100, no config changes
  3. Cost Efficiency: Spot Instances + GPU acceleration
  4. Simplicity: One Rust binary + TiKV, no Hadoop/Spark
  5. Operability: CLI and Web UI for easy management

Metadata

Metadata

Assignees

Labels

area/cloudCloud provider integrationsepicLarge feature spanning multiple issuespriority/highHigh prioritytype/featureNew feature or functionality

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions