ya-dataflow

ya-dataflow is a modern, data-centric AI framework designed to orchestrate complex, large-scale workflows for Large Language Models (LLMs) and Vision-Language Models (VLMs).

It provides a unified, modular, and highly scalable architecture to manage the entire lifecycle of AI data processing—from raw data ingestion and sophisticated multi-modal reasoning to high-throughput trajectory synthesis and large-scale evaluation.

🌟 Core Philosophy: Data-Centric AI

In the era of LLMs, the bottleneck is no longer just model capacity, but the quality and scale of the data pipelines that feed them. ya-dataflow is built to address this by treating data as a first-class citizen, providing a robust engine to transform raw, unstructured data into high-quality, structured datasets for training, fine-tuning, and evaluation.

✨ Key Capabilities

🏗️ Modular Operator Architecture

At the heart of ya-dataflow is a highly extensible operator-based design. Every task—whether it's a simple text cleaning step, a complex RAG retrieval, or a sophisticated VLM reasoning process—is implemented as a standardized Operator. This allows for seamless composition of complex pipelines.

🚀 Scalable Pipeline Execution

Designed for massive workloads, ya-dataflow supports advanced execution strategies like PartitionPipelineParallelRun. It enables high-throughput, parallel processing across distributed environments, making it capable of handling billions of tokens or millions of multi-modal assets.

🧪 High-Fidelity Synthetic Data Generation

Leveraging a powerful binary and structured file generation system, ya-dataflow can synthesize diverse datasets—including JSON, XML, Markdown, HTML, and complex binary formats like PDF, XLSX, and PPTX—to create robust testing and training environments.

🤖 Agentic Ecosystem Integration

Deeply integrated with the OpenClaw and Nanobot ecosystems. Through CLIOpenClawServing and NanobotServing, your data pipelines become accessible intelligence that can be dynamically invoked and orchestrated by autonomous agents.

🚀 Why ya-dataflow? (Advanced Features)

While many frameworks handle simple tasks, ya-dataflow is engineered for production-grade, massive-scale AI data engineering.

💎 Enterprise-Grade Data Management

Cloud-Native Storage: Native, seamless integration with S3 and other cloud storage providers.
Read-Write Separation: Decoupled DataSource and Storage layers for maximum flexibility and control.
Intelligent Caching: Advanced-level caching mechanisms (CacheStorage) to optimize I/O and accelerate repetitive workloads.

💎 Production-Ready Reliability

Checkpointing & Resumption: Built-in support for checkpointing. If a massive job (running for days) is interrupted, it can resume exactly from where it left off.
Fine-Grained Parallelism: Beyond simple task parallelism, our Partition-level parallelism allows you to scale throughput by splitting datasets into granular work units.

💎 Seamless Agentic Connectivity

OpenClaw Powered: Use CLIOpenClawServing to bridge the gap between agentic reasoning and heavy-duty data pipelines.
Nanobot Ready: Integrated with NanobotServing for lightweight, high-performance serving within the Nanobot SDK ecosystem.

🚀 Getting Started

1. Installation

Install the core framework:

pip install ya-dataflow

Install with specialized capabilities via extras:

# For RAG workflows
pip install ya-dataflow[rag]

# For Multimodal (VLM) and PDF processing
pip install ya-dataflow[pdf2vqa]

# For LLM serving and high-performance evaluation
pip install ya-dataflow[vllm,eval]

# For Code and Math reasoning tasks
pip install ya-dataflow[code,reasoning]

2. Basic Usage (Python API)

In production, you define a pipeline by inheriting from PartitionPipelineParallelRun and implementing the forward method to orchestrate your operators.

from dataflow.pipeline import PartitionPipelineParallelRun
from dataflow.operators.core_text import TextCleaningOperator
from dataflow.utils.storage import FileDataSource, FileStorage, FileCacheStorage
from dataflow.serving.api_llm_serving_request import APILLMServing_request

class MyProductionPipeline(PartitionPipelineParallelRun):
    def __init__(self, source: FileDataSource, storage: FileStorage, llm_serving: APILLMServing_request):
        # 1. Initialize CacheStorage (Crucial: cannot be None)
        cache_storage = FileCacheStorage(cache_path="./cache")
        
        # 2. Initialize base class with cache_storage and explicit partitions
        super().__init__(cache_storage=cache_storage, partitions=10)
        
        self.storage = storage
        self.llm_serving = llm_serving
        
        # Define Operators
        self.clean_op = TextCleaningOperator(self.llm_serving)
        self.refine_op = SomeRefineOperator(self.llm_serving)

    def forward(self):
        # Step 1: Clean the raw text
        # .step() retrieves the current partition's data
        self.clean_op.run(
            self.storage.step(),
            input_key="raw_text",
            output_key="cleaned_text"
        )

        # Step 2: Refine based on cleaned text (Dependency: cleaned_text)
        self.refine_op.run(
            self.storage.step(),
            input_key="raw_text",
            output_key="final_result",
            input_prev_1="cleaned_text" 
        )

# Usage
source = FileDataSource(paths=["./input.jsonl"])
storage = FileStorage(data_source=source, id_key="id", cache_path="./cache")
llm = APILLMServing_request(api_url="...", model_name="...")

pipeline = MyProductionPipeline(source, storage, llm)
pipeline.compile()
pipeline.run()

4. New: Generator Data Sources (v1.0.8+)

Generate data on-the-fly without relying on pre-existing files.

GeneratorDataSource - Base Data + LLM Enhancement

Use when you have base data and want to enhance it with LLM-generated fields:

from dataflow.utils.storage import GeneratorDataSource, FileCacheStorage
from dataflow.serving.agent.cli_openclaw_serving import CLIOpenClawServing

# Define your base data generator
def task_generator():
    """Yield base task data"""
    for i in range(1000):
        yield {
            "index": i,
            "scene": "search" if i % 2 == 0 else "analysis",
            "keywords": "特斯拉" if i % 2 == 0 else "财务数据"
        }

# Create data source with LLM enhancement
source = GeneratorDataSource(
    generator_fn=task_generator,
    total_rows=1000,
    name="enhanced_tasks",
    serving=CLIOpenClawServing(agent_id="main"),
    prompt_templates={
        "question": "基于场景 {scene} 和关键词 {keywords}，生成一个真实的技能使用问题。返回 JSON: {{\"question\": \"...\"}}",
        "target_skills": "基于场景 {scene}，选择 2-3 个适合的技能。返回 JSON: {{\"target_skills\": [...]}}",
    },
    fields_from_base=["index", "scene", "keywords"],
)

# Read data (LLM fields are generated on-the-fly)
for row in source.read(chunk_size=32):
    print(row)  # Contains: index, scene, keywords, question, target_skills

LLMGeneratorDataSource - Pure LLM Generation

Use when you want LLM to generate all data from scratch:

from dataflow.utils.storage import LLMGeneratorDataSource

# Pure LLM generation - no base data needed
source = LLMGeneratorDataSource(
    serving=CLIOpenClawServing(agent_id="main"),
    prompts={
        "question": "生成一个真实的 OpenClaw 技能使用问题。返回 JSON: {{\"question\": \"...\"}}",
        "target_skills": "为这个问题选择 2-3 个合适的技能。返回 JSON: {{\"target_skills\": [...]}}",
        "difficulty": "评估问题难度（1-5 分）。返回 JSON: {{\"difficulty\": 3}}",
    },
    num_rows=10000,
    batch_size=32,
    name="llm_generated_tasks",
)

# Read generated data
for row in source.read(chunk_size=32):
    print(row)  # Contains: question, target_skills, difficulty

Using create_data_source Factory

from dataflow.utils.storage import create_data_source

# Generator data source
source = create_data_source(
    ["enhanced_tasks"],
    source_type="generator",
    generator_fn=task_generator,
    total_rows=1000,
    serving=CLIOpenClawServing(agent_id="main"),
    prompt_templates={
        "question": "基于场景 {scene} 生成问题",
    },
    fields_from_base=["index", "scene"],
)

# LLM generator data source
source = create_data_source(
    ["llm_tasks"],
    source_type="llm_generator",
    serving=CLIOpenClawServing(agent_id="main"),
    prompts={
        "question": "生成一个技能使用问题",
    },
    num_rows=5000,
)

3. Advanced: High-Scale S3 Pipeline with Resumption

from dataflow.pipeline import PartitionPipelineParallelRun
from dataflow.utils.storage import S3DataSource, S3Storage, S3CacheStorage

# Configure massive S3-based workflow with checkpointing
source = S3DataSource(
    endpoint="https://s3.example.com",
    ak="YOUR_AK", sk="YOUR_SK",
    s3_paths=["s3://my-bucket/massive-dataset/"],
)

storage = S3Storage(
    data_source=source,
    id_key="task_id",
    cache_path="./local_cache",
    cache_type="jsonl"
)

# Enable checkpointing via CacheStorage
progress_storage = S3CacheStorage(
    endpoint="https://s3.example.com",
    ak="YOUR_AK", sk="YOUR_SK",
    cache_file="s3://my-bucket/checkpoints/pipeline_v1.json"
)

pipeline = PartitionPipelineParallelRun(
    steps=[...],
    data_source=source,
    storage=storage,
    cache_storage=progress_storage,
    partitions=1000, # Scale to thousands of partitions
    max_parallelism=32
)

# Run with automatic resumption
pipeline.run(resume_from_last=True)

📂 Project Structure

dataflow/
├── core/               # Core engine, registry, and base abstractions
├── operators/          # Extensive library of built-in operators
│   ├── core_text/      # Text processing, cleaning, and extraction
│   ├── core_vision/    # VLM and image reasoning
│   ├── code/           # Code synthesis and execution
│   ├── reasoning/      # Math and logical reasoning
│   └── ...             # Specialized domains (RAG, PDF2VQA, etc.)
├── pipeline/           # Pipeline orchestration and execution logic
├── serving/            # LLM/VLM serving integrations (vLLM, OpenAI, etc.)
├── utils/              # Storage, registry, and utility helpers
└── ...

🤝 Contributing

ya-dataflow is an evolving ecosystem. We welcome contributions from the community to expand its operator library and performance capabilities. Please visit the main repository for contribution guidelines.

GitHub Repository

📄 License

This project is licensed under the Apache-2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 1,375 Commits
.github		.github
dataflow		dataflow
static/logo		static/logo
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README-zh.md		README-zh.md
README.md		README.md
awesome_dataflow.md		awesome_dataflow.md
entrypoint.sh		entrypoint.sh
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
sc.json		sc.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ya-dataflow

🌟 Core Philosophy: Data-Centric AI

✨ Key Capabilities

🏗️ Modular Operator Architecture

🚀 Scalable Pipeline Execution

🧪 High-Fidelity Synthetic Data Generation

🤖 Agentic Ecosystem Integration

🚀 Why ya-dataflow? (Advanced Features)

💎 Enterprise-Grade Data Management

💎 Production-Ready Reliability

💎 Seamless Agentic Connectivity

🚀 Getting Started

1. Installation

2. Basic Usage (Python API)

4. New: Generator Data Sources (v1.0.8+)

GeneratorDataSource - Base Data + LLM Enhancement

LLMGeneratorDataSource - Pure LLM Generation

Using create_data_source Factory

3. Advanced: High-Scale S3 Pipeline with Resumption

📂 Project Structure

🤝 Contributing

📄 License

About

Uh oh!

Releases 11

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ya-dataflow

🌟 Core Philosophy: Data-Centric AI

✨ Key Capabilities

🏗️ Modular Operator Architecture

🚀 Scalable Pipeline Execution

🧪 High-Fidelity Synthetic Data Generation

🤖 Agentic Ecosystem Integration

🚀 Why ya-dataflow? (Advanced Features)

💎 Enterprise-Grade Data Management

💎 Production-Ready Reliability

💎 Seamless Agentic Connectivity

🚀 Getting Started

1. Installation

2. Basic Usage (Python API)

4. New: Generator Data Sources (v1.0.8+)

GeneratorDataSource - Base Data + LLM Enhancement

LLMGeneratorDataSource - Pure LLM Generation

Using create_data_source Factory

3. Advanced: High-Scale S3 Pipeline with Resumption

📂 Project Structure

🤝 Contributing

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 11

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages