Skip to content

paiml/reprorusted-python-cli

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Bootstrap to ML Oracle

reprorusted-python-cli

Compiler-in-the-Loop Training Corpus for Python→Rust Transpilation

CI License: MIT Tests Examples HuggingFace Dataset


The Core Idea

Use LLMs to bootstrap traditional ML, not as a runtime dependency.

This corpus captures decision traces from autonomous LLM sessions, persists them to .apr format, and trains a local ML oracle. After N sessions, the oracle handles common cases without API calls.

Phase Model Cost
Bootstrap LLM (Claude/GPT) $$/hour
Capture Decision traces → .apr One-time
Steady-state HNSW + Tarantula ~$0

What This Repository Contains

298 Python CLI examples with 6,745 tests, designed for depyler transpiler training:

Metric Value
Python Examples 298
Test Coverage 6,745 passing
Transpilation Rate 78.5% (476/606)
Clippy Clean 0.4% (progressive)
Golden Traces 50 patterns

The remaining compilation failures are training signal—each error becomes an (error, fix) pattern for the oracle.

Quick Start

git clone https://github.com/paiml/reprorusted-python-cli.git
cd reprorusted-python-cli

make install    # Install dependencies
make test       # Validate corpus (6745 tests)
make citl-train # Train oracle from diagnostics

Corpus Extraction

Extract additional training data from CPython stdlib doctests:

make extract-cpython-doctests  # Requires alimentar

This extracts ~1,700 doctests to data/corpora/cpython-doctests.parquet. See docs/corpus-extraction.md for full reproducibility details.

CITL Training Loop

Python Corpus → Depyler → Rust Code → rustc → Diagnostics → Oracle
     ↑                                              │
     └──────────── patterns.apr ◄──────────────────┘

Each transpilation attempt generates compiler diagnostics. These accumulate in .apr format, training the oracle to suggest fixes for future errors.

Example Categories

Category Count Description
Core CLI 20+ argparse, flags, subcommands
String Ops 15+ split, join, format, strip
Math 20+ arithmetic, statistics
NumPy 18 array ops, linear algebra
Sklearn 10 regression, clustering
PyTorch 10 tensors, autograd
Async 5 async/await patterns
File I/O 10+ pathlib, csv, json

Autonomous Session Results

We ran Claude Code unattended for 13 hours:

Metric Target Actual
Commits 5 12 (240%)
Duration 6 hrs 13 hrs
Tickets - DEPYLER-0616→0627

The LLM fixed bugs, wrote tests, passed clippy, and committed—no human intervention.

Error Distribution

Top rustc errors from transpilation attempts:

Code Count Issue
E0308 1,050 Type mismatch
E0433 706 Failed to resolve
E0599 543 Method not found
E0425 392 Cannot find value
E0277 380 Trait bound not satisfied

Each error type becomes training data for the oracle.

Quality Assurance (Toyota Way)

The corpus implements quality gates inspired by Lean/Toyota principles:

Gate Tool Purpose
Golden Traces make corpus-golden-export 50 human-verified fix patterns
Clippy Gate make corpus-clippy-check Idiomatic Rust verification
HITL Review make corpus-hitl-sample Quarterly expert review (5% sample)

See docs/specifications/corpus-quality-review.md for the full Toyota Way design review.

Quick Commands

# Analyze golden trace candidates
make corpus-golden-analyze

# Run clippy quality gate
make corpus-clippy-check

# Generate HITL review sample
make corpus-hitl-sample

Project Structure

reprorusted-python-cli/
├── examples/           # 298 Python CLI examples
│   ├── example_*/      # Individual examples with tests
├── docs/               # Specifications and diagrams
├── scripts/            # Automation
└── Makefile            # CITL commands

Integration

# Transpile single example
depyler transpile examples/example_simple/trivial_cli.py

# Train oracle from corpus
depyler oracle train --corpus ./examples

# Export for downstream ML
depyler oracle export-oip --output ./citl.jsonl

Related Projects

Project Role
depyler Python→Rust transpiler
aprender ML library, .apr format
entrenar CITL pattern storage
alimentar Dataset loading & publishing
renacer Decision trace ingestion

HuggingFace Dataset

This corpus is available on HuggingFace for ML training:

from datasets import load_dataset

ds = load_dataset("paiml/depyler-citl")

# 606 Python→Rust pairs, 436 with successful transpilation
for row in ds["train"]:
    print(f"{row['python_file']}: {row['python_lines']}{row['rust_lines']} lines")

📦 Dataset: huggingface.co/datasets/paiml/depyler-citl

References

  1. Wang et al. (2022). Compilable Neural Code Generation with Compiler Feedback. ACL.
  2. Yasunaga & Liang (2020). Graph-based, Self-Supervised Program Repair from Diagnostic Feedback. ICML.
  3. Dou et al. (2024). StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback. arXiv.

License

MIT License - see LICENSE

@software{reprorusted_python_cli,
  title = {CITL Training Corpus for Depyler},
  author = {PAIML},
  year = {2025},
  url = {https://github.com/paiml/reprorusted-python-cli}
}

About

Converts Python Argparse CLI scripts to single-shot compiled Rust binary

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published