[Epic] Distributed Roboflow with Alibaba Cloud (OSS + ACK)

## Overview

Transform roboflow into a distributed, fault-tolerant system using **TiKV** for coordination and **shared-nothing compute** architecture.

### Design Documents

> **📋 Updated Roadmap:** See [DISTRIBUTED_DESIGN.md](https://github.com/archebase/roboflow/blob/main/docs/DISTRIBUTED_DESIGN.md) for the 10 Gbps throughput design.
> 
> **🗺️ Issue Alignment:** See [ROADMAP_ALIGNMENT.md](https://github.com/archebase/roboflow/blob/main/docs/ROADMAP_ALIGNMENT.md) for mapping legacy phases to the new 5-phase roadmap.

### Key Characteristics
- **Shared-Nothing Compute:** All worker pods are identical peers, no central master
- **State-Externalized:** All state (jobs, locks, checkpoints) in TiKV
- **Reactive Streaming:** Async streams with backpressure, bounded memory
- **Hardware Aware:** Separate CPU (parsing) and GPU (encoding) workloads
- **Spot-Friendly:** Checkpointing enables Spot Instance usage
- **10 Gbps Target:** Designed for high-throughput processing (~1125 files/hour at 4GB each)

## Architecture
```
┌─────────────────────────────────────────────────────────────────────────┐
│                        Alibaba Cloud Infrastructure                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌──────────────┐    ┌──────────────────────────────────────────────┐  │
│  │   Web UI     │───▶│              ACK/EKS Cluster                  │  │
│  │  (Monitor)   │    │                                               │  │
│  └──────────────┘    │  ┌────────────────────────────────────────┐  │  │
│                      │  │      Scanner Actor (Leader-elected)     │  │  │
│  ┌──────────────┐    │  │  - List OSS for new files               │  │  │
│  │   CLI        │───▶│  │  - Insert jobs into TiKV                │  │  │
│  │ (Submit/Mgmt)│    │  └────────────────────────────────────────┘  │  │
│  └──────────────┘    │                                               │  │
│                      │  ┌────────────────────────────────────────┐  │  │
│  ┌──────────────┐    │  │      Worker Pods (N identical peers)    │  │  │
│  │ OSS (Input)  │◀──▶│  │  - Claim jobs via TiKV CAS              │  │  │
│  │  /raw-data/  │    │  │  - Stream from OSS (range requests)     │  │  │
│  └──────────────┘    │  │  - Checkpoint progress to TiKV          │  │  │
│                      │  │  - Multipart upload to OSS              │  │  │
│  ┌──────────────┐    │  └────────────────────────────────────────┘  │  │
│  │ OSS (Output) │◀──▶│                                               │  │
│  │  /lerobot/   │    │  ┌────────────────────────────────────────┐  │  │
│  └──────────────┘    │  │           API Server                    │  │  │
│                      │  │  - REST API for job management          │  │  │
│  ┌──────────────┐    │  │  - Web UI for monitoring                │  │  │
│  │    TiKV      │◀──▶│  └────────────────────────────────────────┘  │  │
│  │   Cluster    │    │                                               │  │
│  └──────────────┘    └──────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────┘
```

## TiKV Data Model

| Key Pattern | Purpose |
|-------------|---------|
| `/jobs/{hash}` | Job definition and status |
| `/locks/{hash}` | Distributed locks with TTL |
| `/state/{hash}` | Frame-level checkpoint |
| `/heartbeat/{pod_id}` | Worker liveness |
| `/system/scanner_lock` | Scanner leadership |

## Implementation Phases

### Phase 1-3: Storage & LeRobot ✅ COMPLETE
- [x] #10, #11, #23, #24, #25 - Storage Foundation
- [x] #12, #13, #14, #15 - Cloud Storage
- [x] #16, #17 - LeRobot Writer

### Phase 4: TiKV Coordination Layer ✅ COMPLETE
- [x] #40 - Add TiKV client and define schema
- [x] #41 - Implement distributed lock manager
- [x] #42 - Implement Scanner actor with leader election
- [x] #43 - Implement Worker loop with job claiming
- [x] #44 - Implement heartbeat and zombie detection

### Phase 5: Checkpointing System ✅ COMPLETE
- [x] #19 - Frame-level checkpoint with TiKV and multipart resume

### Phase 6: Storage Enhancements ✅ COMPLETE
- [x] #45 - Add streaming S3 reader with range requests
- [x] #46 - Add parallel multipart uploads
- [x] #26 - Add storage support to StreamingDatasetConverter
- [x] #27 - Update CLI for cloud URLs

### Phase 7: Pipeline Integration 🚧 IN PROGRESS
- [x] #72 - **Integrate LerobotWriter with Worker.process_job()** ⭐ CRITICAL
- [x] #73 - Add checkpoint save during pipeline processing
- [x] #47 - Integrate pipeline with checkpoint hooks
- [x] #48 - Add graceful shutdown handling

### Phase 8: GPU Acceleration (Optional)
- [ ] #49 - Add NVENC video encoding support

### Phase 9: Kubernetes Deployment
- [x] #18 - Long-running Worker Deployment
- [x] #20 - Create container image and Helm chart

### Phase 10: CLI & Web UI
- [ ] #50 - CLI for job submission and management
- [ ] #51 - Web UI for job monitoring and management

### Phase 11: Observability
- [ ] #21 - Add Prometheus metrics
- [x] #22 - Add structured logging

## New 5-Phase Roadmap (from DISTRIBUTED_DESIGN.md)

| New Phase | Description | Key Issues |
|-----------|-------------|------------|
| **Phase 1** | Pipeline Integration | #72, #73, #47, #48 |
| **Phase 2** | Prefetch Pipeline | (Future issues) |
| **Phase 3** | GPU Acceleration | #49 |
| **Phase 4** | Production Hardening | #20, #21, #22 |
| **Phase 5** | Multi-Format Support | (KPS already implemented) |

## Dependency Graph

```
Phase 1-6: ✅ COMPLETE

Phase 7: Pipeline Integration (CURRENT PRIORITY)
#72 (LerobotWriter) ─► #73 (Checkpoint Save) ─► #47 (Pipeline Hooks)
                                              └─► #48 (Graceful Shutdown)

Phase 8: GPU (parallel, optional)
#47 ─► #49 (NVENC)

Phase 9: Kubernetes
#47 + #48 ─► #18 ─► #20

Phase 10: CLI & Web UI
#40 ─► #50 (CLI) ─► #51 (Web UI)

Phase 11: Observability
#47 ─► #21 (Metrics)
#18 ─► #22 (Logging)
```

## User Interaction

### CLI Commands
```bash
# Submit jobs
roboflow submit oss://bucket/path/*.mcap --output oss://bucket/output/

# Manage jobs
roboflow jobs list --status failed
roboflow jobs retry <job-id>
roboflow jobs cancel <job-id>

# Start worker
roboflow worker

# Start UI server
roboflow ui --port 8080
```

### Web UI Features
- Dashboard with job status overview
- Job list with filtering and search
- Job detail with progress and logs
- Retry/Cancel/Delete actions
- Worker status monitoring

## Success Criteria

- [ ] Process 1000+ files from OSS without data loss
- [ ] Survive random pod terminations (Spot Instance friendly)
- [ ] Horizontal scaling with N worker pods
- [ ] Resume interrupted jobs from frame-level checkpoint
- [ ] CLI for job submission and management
- [ ] Web UI for monitoring and operations
- [ ] < 10% overhead compared to local processing
- [ ] Full observability with metrics and logs
- [ ] **10 Gbps throughput** with 20-24 GPU workers

## Architecture Benefits

1. **Resilience:** Unplug half the cluster, remaining nodes resume work
2. **Scalability:** Change replicas from 20 to 100, no config changes
3. **Cost Efficiency:** Spot Instances + GPU acceleration
4. **Simplicity:** One Rust binary + TiKV, no Hadoop/Spark
5. **Operability:** CLI and Web UI for easy management

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] Distributed Roboflow with Alibaba Cloud (OSS + ACK) #9

Overview

Design Documents

Key Characteristics

Architecture

TiKV Data Model

Implementation Phases

Phase 1-3: Storage & LeRobot ✅ COMPLETE

Phase 4: TiKV Coordination Layer ✅ COMPLETE

Phase 5: Checkpointing System ✅ COMPLETE

Phase 6: Storage Enhancements ✅ COMPLETE

Phase 7: Pipeline Integration 🚧 IN PROGRESS

Phase 8: GPU Acceleration (Optional)

Phase 9: Kubernetes Deployment

Phase 10: CLI & Web UI

Phase 11: Observability

New 5-Phase Roadmap (from DISTRIBUTED_DESIGN.md)

Dependency Graph

User Interaction

CLI Commands

Web UI Features

Success Criteria

Architecture Benefits

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Key Pattern	Purpose
`/jobs/{hash}`	Job definition and status
`/locks/{hash}`	Distributed locks with TTL
`/state/{hash}`	Frame-level checkpoint
`/heartbeat/{pod_id}`	Worker liveness
`/system/scanner_lock`	Scanner leadership

New Phase	Description	Key Issues
Phase 1	Pipeline Integration	#72, #73, #47, #48
Phase 2	Prefetch Pipeline	(Future issues)
Phase 3	GPU Acceleration	#49
Phase 4	Production Hardening	#20, #21, #22
Phase 5	Multi-Format Support	(KPS already implemented)

[Epic] Distributed Roboflow with Alibaba Cloud (OSS + ACK) #9

Description

Overview

Design Documents

Key Characteristics

Architecture

TiKV Data Model

Implementation Phases

Phase 1-3: Storage & LeRobot ✅ COMPLETE

Phase 4: TiKV Coordination Layer ✅ COMPLETE

Phase 5: Checkpointing System ✅ COMPLETE

Phase 6: Storage Enhancements ✅ COMPLETE

Phase 7: Pipeline Integration 🚧 IN PROGRESS

Phase 8: GPU Acceleration (Optional)

Phase 9: Kubernetes Deployment

Phase 10: CLI & Web UI

Phase 11: Observability

New 5-Phase Roadmap (from DISTRIBUTED_DESIGN.md)

Dependency Graph

User Interaction

CLI Commands

Web UI Features

Success Criteria

Architecture Benefits

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions