Skip to content

NgZiming/YetAnotherDataFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

1,375 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

ya-dataflow

ya-dataflow is a modern, data-centric AI framework designed to orchestrate complex, large-scale workflows for Large Language Models (LLMs) and Vision-Language Models (VLMs).

It provides a unified, modular, and highly scalable architecture to manage the entire lifecycle of AI data processingโ€”from raw data ingestion and sophisticated multi-modal reasoning to high-throughput trajectory synthesis and large-scale evaluation.


๐ŸŒŸ Core Philosophy: Data-Centric AI

In the era of LLMs, the bottleneck is no longer just model capacity, but the quality and scale of the data pipelines that feed them. ya-dataflow is built to address this by treating data as a first-class citizen, providing a robust engine to transform raw, unstructured data into high-quality, structured datasets for training, fine-tuning, and evaluation.


โœจ Key Capabilities

๐Ÿ—๏ธ Modular Operator Architecture

At the heart of ya-dataflow is a highly extensible operator-based design. Every taskโ€”whether it's a simple text cleaning step, a complex RAG retrieval, or a sophisticated VLM reasoning processโ€”is implemented as a standardized Operator. This allows for seamless composition of complex pipelines.

๐Ÿš€ Scalable Pipeline Execution

Designed for massive workloads, ya-dataflow supports advanced execution strategies like PartitionPipelineParallelRun. It enables high-throughput, parallel processing across distributed environments, making it capable of handling billions of tokens or millions of multi-modal assets.

๐Ÿงช High-Fidelity Synthetic Data Generation

Leveraging a powerful binary and structured file generation system, ya-dataflow can synthesize diverse datasetsโ€”including JSON, XML, Markdown, HTML, and complex binary formats like PDF, XLSX, and PPTXโ€”to create robust testing and training environments.

๐Ÿค– Agentic Ecosystem Integration

Deeply integrated with the OpenClaw and Nanobot ecosystems. Through CLIOpenClawServing and NanobotServing, your data pipelines become accessible intelligence that can be dynamically invoked and orchestrated by autonomous agents.


๐Ÿš€ Why ya-dataflow? (Advanced Features)

While many frameworks handle simple tasks, ya-dataflow is engineered for production-grade, massive-scale AI data engineering.

๐Ÿ’Ž Enterprise-Grade Data Management

  • Cloud-Native Storage: Native, seamless integration with S3 and other cloud storage providers.
  • Read-Write Separation: Decoupled DataSource and Storage layers for maximum flexibility and control.
  • Intelligent Caching: Advanced-level caching mechanisms (CacheStorage) to optimize I/O and accelerate repetitive workloads.

๐Ÿ’Ž Production-Ready Reliability

  • Checkpointing & Resumption: Built-in support for checkpointing. If a massive job (running for days) is interrupted, it can resume exactly from where it left off.
  • Fine-Grained Parallelism: Beyond simple task parallelism, our Partition-level parallelism allows you to scale throughput by splitting datasets into granular work units.

๐Ÿ’Ž Seamless Agentic Connectivity

  • OpenClaw Powered: Use CLIOpenClawServing to bridge the gap between agentic reasoning and heavy-duty data pipelines.
  • Nanobot Ready: Integrated with NanobotServing for lightweight, high-performance serving within the Nanobot SDK ecosystem.

๐Ÿš€ Getting Started

1. Installation

Install the core framework:

pip install ya-dataflow

Install with specialized capabilities via extras:

# For RAG workflows
pip install ya-dataflow[rag]

# For Multimodal (VLM) and PDF processing
pip install ya-dataflow[pdf2vqa]

# For LLM serving and high-performance evaluation
pip install ya-dataflow[vllm,eval]

# For Code and Math reasoning tasks
pip install ya-dataflow[code,reasoning]

2. Basic Usage (Python API)

In production, you define a pipeline by inheriting from PartitionPipelineParallelRun and implementing the forward method to orchestrate your operators.

from dataflow.pipeline import PartitionPipelineParallelRun
from dataflow.operators.core_text import TextCleaningOperator
from dataflow.utils.storage import FileDataSource, FileStorage, FileCacheStorage
from dataflow.serving.api_llm_serving_request import APILLMServing_request

class MyProductionPipeline(PartitionPipelineParallelRun):
    def __init__(self, source: FileDataSource, storage: FileStorage, llm_serving: APILLMServing_request):
        # 1. Initialize CacheStorage (Crucial: cannot be None)
        cache_storage = FileCacheStorage(cache_path="./cache")
        
        # 2. Initialize base class with cache_storage and explicit partitions
        super().__init__(cache_storage=cache_storage, partitions=10)
        
        self.storage = storage
        self.llm_serving = llm_serving
        
        # Define Operators
        self.clean_op = TextCleaningOperator(self.llm_serving)
        self.refine_op = SomeRefineOperator(self.llm_serving)

    def forward(self):
        # Step 1: Clean the raw text
        # .step() retrieves the current partition's data
        self.clean_op.run(
            self.storage.step(),
            input_key="raw_text",
            output_key="cleaned_text"
        )

        # Step 2: Refine based on cleaned text (Dependency: cleaned_text)
        self.refine_op.run(
            self.storage.step(),
            input_key="raw_text",
            output_key="final_result",
            input_prev_1="cleaned_text" 
        )

# Usage
source = FileDataSource(paths=["./input.jsonl"])
storage = FileStorage(data_source=source, id_key="id", cache_path="./cache")
llm = APILLMServing_request(api_url="...", model_name="...")

pipeline = MyProductionPipeline(source, storage, llm)
pipeline.compile()
pipeline.run()

4. New: Generator Data Sources (v1.0.8+)

Generate data on-the-fly without relying on pre-existing files.

GeneratorDataSource - Base Data + LLM Enhancement

Use when you have base data and want to enhance it with LLM-generated fields:

from dataflow.utils.storage import GeneratorDataSource, FileCacheStorage
from dataflow.serving.agent.cli_openclaw_serving import CLIOpenClawServing

# Define your base data generator
def task_generator():
    """Yield base task data"""
    for i in range(1000):
        yield {
            "index": i,
            "scene": "search" if i % 2 == 0 else "analysis",
            "keywords": "็‰นๆ–ฏๆ‹‰" if i % 2 == 0 else "่ดขๅŠกๆ•ฐๆฎ"
        }

# Create data source with LLM enhancement
source = GeneratorDataSource(
    generator_fn=task_generator,
    total_rows=1000,
    name="enhanced_tasks",
    serving=CLIOpenClawServing(agent_id="main"),
    prompt_templates={
        "question": "ๅŸบไบŽๅœบๆ™ฏ {scene} ๅ’Œๅ…ณ้”ฎ่ฏ {keywords}๏ผŒ็”Ÿๆˆไธ€ไธช็œŸๅฎž็š„ๆŠ€่ƒฝไฝฟ็”จ้—ฎ้ข˜ใ€‚่ฟ”ๅ›ž JSON: {{\"question\": \"...\"}}",
        "target_skills": "ๅŸบไบŽๅœบๆ™ฏ {scene}๏ผŒ้€‰ๆ‹ฉ 2-3 ไธช้€‚ๅˆ็š„ๆŠ€่ƒฝใ€‚่ฟ”ๅ›ž JSON: {{\"target_skills\": [...]}}",
    },
    fields_from_base=["index", "scene", "keywords"],
)

# Read data (LLM fields are generated on-the-fly)
for row in source.read(chunk_size=32):
    print(row)  # Contains: index, scene, keywords, question, target_skills

LLMGeneratorDataSource - Pure LLM Generation

Use when you want LLM to generate all data from scratch:

from dataflow.utils.storage import LLMGeneratorDataSource

# Pure LLM generation - no base data needed
source = LLMGeneratorDataSource(
    serving=CLIOpenClawServing(agent_id="main"),
    prompts={
        "question": "็”Ÿๆˆไธ€ไธช็œŸๅฎž็š„ OpenClaw ๆŠ€่ƒฝไฝฟ็”จ้—ฎ้ข˜ใ€‚่ฟ”ๅ›ž JSON: {{\"question\": \"...\"}}",
        "target_skills": "ไธบ่ฟ™ไธช้—ฎ้ข˜้€‰ๆ‹ฉ 2-3 ไธชๅˆ้€‚็š„ๆŠ€่ƒฝใ€‚่ฟ”ๅ›ž JSON: {{\"target_skills\": [...]}}",
        "difficulty": "่ฏ„ไผฐ้—ฎ้ข˜้šพๅบฆ๏ผˆ1-5 ๅˆ†๏ผ‰ใ€‚่ฟ”ๅ›ž JSON: {{\"difficulty\": 3}}",
    },
    num_rows=10000,
    batch_size=32,
    name="llm_generated_tasks",
)

# Read generated data
for row in source.read(chunk_size=32):
    print(row)  # Contains: question, target_skills, difficulty

Using create_data_source Factory

from dataflow.utils.storage import create_data_source

# Generator data source
source = create_data_source(
    ["enhanced_tasks"],
    source_type="generator",
    generator_fn=task_generator,
    total_rows=1000,
    serving=CLIOpenClawServing(agent_id="main"),
    prompt_templates={
        "question": "ๅŸบไบŽๅœบๆ™ฏ {scene} ็”Ÿๆˆ้—ฎ้ข˜",
    },
    fields_from_base=["index", "scene"],
)

# LLM generator data source
source = create_data_source(
    ["llm_tasks"],
    source_type="llm_generator",
    serving=CLIOpenClawServing(agent_id="main"),
    prompts={
        "question": "็”Ÿๆˆไธ€ไธชๆŠ€่ƒฝไฝฟ็”จ้—ฎ้ข˜",
    },
    num_rows=5000,
)

3. Advanced: High-Scale S3 Pipeline with Resumption

from dataflow.pipeline import PartitionPipelineParallelRun
from dataflow.utils.storage import S3DataSource, S3Storage, S3CacheStorage

# Configure massive S3-based workflow with checkpointing
source = S3DataSource(
    endpoint="https://s3.example.com",
    ak="YOUR_AK", sk="YOUR_SK",
    s3_paths=["s3://my-bucket/massive-dataset/"],
)

storage = S3Storage(
    data_source=source,
    id_key="task_id",
    cache_path="./local_cache",
    cache_type="jsonl"
)

# Enable checkpointing via CacheStorage
progress_storage = S3CacheStorage(
    endpoint="https://s3.example.com",
    ak="YOUR_AK", sk="YOUR_SK",
    cache_file="s3://my-bucket/checkpoints/pipeline_v1.json"
)

pipeline = PartitionPipelineParallelRun(
    steps=[...],
    data_source=source,
    storage=storage,
    cache_storage=progress_storage,
    partitions=1000, # Scale to thousands of partitions
    max_parallelism=32
)

# Run with automatic resumption
pipeline.run(resume_from_last=True)

๐Ÿ“‚ Project Structure

dataflow/
โ”œโ”€โ”€ core/               # Core engine, registry, and base abstractions
โ”œโ”€โ”€ operators/          # Extensive library of built-in operators
โ”‚   โ”œโ”€โ”€ core_text/      # Text processing, cleaning, and extraction
โ”‚   โ”œโ”€โ”€ core_vision/    # VLM and image reasoning
โ”‚   โ”œโ”€โ”€ code/           # Code synthesis and execution
โ”‚   โ”œโ”€โ”€ reasoning/      # Math and logical reasoning
โ”‚   โ””โ”€โ”€ ...             # Specialized domains (RAG, PDF2VQA, etc.)
โ”œโ”€โ”€ pipeline/           # Pipeline orchestration and execution logic
โ”œโ”€โ”€ serving/            # LLM/VLM serving integrations (vLLM, OpenAI, etc.)
โ”œโ”€โ”€ utils/              # Storage, registry, and utility helpers
โ””โ”€โ”€ ...

๐Ÿค Contributing

ya-dataflow is an evolving ecosystem. We welcome contributions from the community to expand its operator library and performance capabilities. Please visit the main repository for contribution guidelines.

GitHub Repository

๐Ÿ“„ License

This project is licensed under the Apache-2.0 license.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages