Roboflow

Roboflow is a distributed data transformation pipeline for converting robotics bag/MCAP files to trainable datasets (LeRobot format).

Features

Horizontal Scaling: Distributed processing with TiKV coordination
Schema-Driven Translation: CDR (ROS1/ROS2), Protobuf, JSON message formats
Zero-Copy Allocation: Arena-based memory efficiency (~22% overhead reduction)
Cloud Storage: Native S3 and Alibaba OSS support for distributed workloads
High Throughput: Parallel chunk processing up to ~1800 MB/s
LeRobot Export: Convert to LeRobot dataset format for robotics learning

Architecture

Roboflow uses a Kubernetes-inspired distributed control plane for fault-tolerant batch processing.

┌─────────────────────────────────────────────────────────────────────┐
│                         Control Plane                               │
├─────────────────────────────────────────────────────────────────────┤
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │
│  │   Scanner    │  │   Reaper     │  │  Finalizer   │              │
│  │  Controller  │  │  Controller  │  │  Controller  │              │
│  │              │  │              │  │              │              │
│  │ • Discover   │  │ • Detect     │  │ • Monitor    │              │
│  │   files      │  │   stale pods │  │   batches    │              │
│  │ • Create     │  │ • Reclaim    │  │ • Trigger    │              │
│  │   jobs       │  │   orphaned   │  │   merge      │              │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘              │
│         │                 │                 │                       │
│         └─────────────────┼─────────────────┘                       │
│                           │                                         │
│                           ▼                                         │
│                    ┌─────────────┐                                  │
│                    │    TiKV     │                                  │
│                    │  (etcd-like │                                  │
│                    │   state)    │                                  │
│                    └─────────────┘                                  │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                          Data Plane                                 │
├─────────────────────────────────────────────────────────────────────┤
│  Worker (pod-abc)    Worker (pod-def)    Worker (pod-xyz)           │
│  • Claim jobs        • Claim jobs        • Claim jobs               │
│  • Send heartbeat    • Send heartbeat    • Send heartbeat           │
│  • Process data      • Process data      • Process data             │
│  • Save checkpoint   • Save checkpoint   • Save checkpoint          │
└─────────────────────────────────────────────────────────────────────┘

Key Patterns

Kubernetes Concept	Roboflow Equivalent
Pod	`Worker` with `pod_id`
etcd	TiKV distributed store
kubelet heartbeat	`HeartbeatManager`
node-controller	`ZombieReaper`
Finalizers	`Finalizer` controller
Job/CronJob	`JobRecord`, `BatchSpec`
State machine	`BatchPhase` (Pending → Discovering → Running → Merging → Complete/Failed)

Workspace Structure

Crate	Purpose
`roboflow-core`	Error types, registry, values
`roboflow-storage`	S3, OSS, Local storage (always available)
`roboflow-dataset`	KPS, LeRobot, streaming converters
`roboflow-distributed`	TiKV client, catalog, controllers
`roboflow-hdf5`	Optional HDF5 format support
`roboflow-pipeline`	Hyper pipeline, compression stages

Quick Start

Submit a Conversion Job

roboflow submit \
  --input s3://bucket/input.bag \
  --output s3://bucket/output/ \
  --config lerobot_config.toml

Run a Worker

export TIKV_PD_ENDPOINTS="127.0.0.1:2379"
export AWS_ACCESS_KEY_ID="your-key"
export AWS_SECRET_ACCESS_KEY="your-secret"

roboflow worker

Run a Scanner

export SCANNER_INPUT_PREFIX="s3://bucket/input/"
export SCANNER_OUTPUT_PREFIX="s3://bucket/jobs/"

roboflow scanner

List Jobs

roboflow jobs list
roboflow jobs get <job-id>
roboflow jobs retry <job-id>

Installation

From Source

git clone https://github.com/archebase/roboflow.git
cd roboflow
cargo build --release

Requirements

Rust 1.80+
TiKV 4.0+ (for distributed coordination)
ffmpeg (for video encoding in LeRobot datasets)

Configuration

LeRobot Dataset Config (`lerobot_config.toml`)

[dataset]
name = "my_dataset"
fps = 30
robot_type = "stretch"

[[mapping]]
topic = "/camera/image_raw"
name = "observation.images.camera_0"
encoding = "ros1msg"

[[mapping]]
topic = "/joint_states"
name = "observation.joint_state"
encoding = "cdr"

Environment Variables

Variable	Description	Default
`TIKV_PD_ENDPOINTS`	TiKV PD endpoints	`127.0.0.1:2379`
`AWS_ACCESS_KEY_ID`	AWS access key	-
`AWS_SECRET_ACCESS_KEY`	AWS secret key	-
`AWS_REGION`	AWS region	-
`OSS_ACCESS_KEY_ID`	Alibaba OSS key	-
`OSS_ACCESS_KEY_SECRET`	Alibaba OSS secret	-
`OSS_ENDPOINT`	Alibaba OSS endpoint	-
`WORKER_POLL_INTERVAL_SECS`	Job poll interval	`5`
`WORKER_MAX_CONCURRENT_JOBS`	Max concurrent jobs	`1`
`SCANNER_SCAN_INTERVAL_SECS`	Scan interval	`60`

Development

Build

cargo build

Test

cargo test

Format & Lint

cargo fmt
cargo clippy --all-targets -- -D warnings

Contributing

See CONTRIBUTING.md for development setup and guidelines.

License

This project is licensed under the MulanPSL v2 - see the LICENSE file for details.

Related Projects

robocodec - I/O, codecs, arena allocation
LeRobot - Robotics learning datasets
TiKV - Distributed transaction KV store

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.cargo		.cargo
.claude		.claude
.cursor/plans		.cursor/plans
.github		.github
.serena		.serena
LICENSES		LICENSES
benches		benches
controller		controller
crates		crates
deploy		deploy
docs		docs
examples		examples
helm/roboflow		helm/roboflow
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CODE_OF_CONDUCT_zh.md		CODE_OF_CONDUCT_zh.md
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTING_zh.md		CONTRIBUTING_zh.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile.worker		Dockerfile.worker
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
README_zh.md		README_zh.md
REUSE.toml		REUSE.toml
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
SECURITY_zh.md		SECURITY_zh.md
codecov.yml		codecov.yml
docker-compose.yml		docker-compose.yml
test_config.toml		test_config.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Roboflow

Features

Architecture

Key Patterns

Workspace Structure

Quick Start

Submit a Conversion Job

Run a Worker

Run a Scanner

List Jobs

Installation

From Source

Requirements

Configuration

LeRobot Dataset Config (`lerobot_config.toml`)

Environment Variables

Development

Build

Test

Format & Lint

Contributing

License

Related Projects

Links

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

archebase/roboflow

Folders and files

Latest commit

History

Repository files navigation

Roboflow

Features

Architecture

Key Patterns

Workspace Structure

Quick Start

Submit a Conversion Job

Run a Worker

Run a Scanner

List Jobs

Installation

From Source

Requirements

Configuration

LeRobot Dataset Config (lerobot_config.toml)

Environment Variables

Development

Build

Test

Format & Lint

Contributing

License

Related Projects

Links

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

LeRobot Dataset Config (`lerobot_config.toml`)

Packages