reprorusted-python-cli

Compiler-in-the-Loop Training Corpus for Python→Rust Transpilation

The Core Idea

Use LLMs to bootstrap traditional ML, not as a runtime dependency.

This corpus captures decision traces from autonomous LLM sessions, persists them to .apr format, and trains a local ML oracle. After N sessions, the oracle handles common cases without API calls.

Phase	Model	Cost
Bootstrap	LLM (Claude/GPT)	$$/hour
Capture	Decision traces → `.apr`	One-time
Steady-state	HNSW + Tarantula	~$0

What This Repository Contains

298 Python CLI examples with 6,745 tests, designed for depyler transpiler training:

Metric	Value
Python Examples	298
Test Coverage	6,745 passing
Transpilation Rate	78.5% (476/606)
Clippy Clean	0.4% (progressive)
Golden Traces	50 patterns

The remaining compilation failures are training signal—each error becomes an (error, fix) pattern for the oracle.

Quick Start

git clone https://github.com/paiml/reprorusted-python-cli.git
cd reprorusted-python-cli

make install    # Install dependencies
make test       # Validate corpus (6745 tests)
make citl-train # Train oracle from diagnostics

Corpus Extraction

Extract additional training data from CPython stdlib doctests:

make extract-cpython-doctests  # Requires alimentar

This extracts ~1,700 doctests to data/corpora/cpython-doctests.parquet. See docs/corpus-extraction.md for full reproducibility details.

CITL Training Loop

Python Corpus → Depyler → Rust Code → rustc → Diagnostics → Oracle
     ↑                                              │
     └──────────── patterns.apr ◄──────────────────┘

Each transpilation attempt generates compiler diagnostics. These accumulate in .apr format, training the oracle to suggest fixes for future errors.

Example Categories

Category	Count	Description
Core CLI	20+	argparse, flags, subcommands
String Ops	15+	split, join, format, strip
Math	20+	arithmetic, statistics
NumPy	18	array ops, linear algebra
Sklearn	10	regression, clustering
PyTorch	10	tensors, autograd
Async	5	async/await patterns
File I/O	10+	pathlib, csv, json

Autonomous Session Results

We ran Claude Code unattended for 13 hours:

Metric	Target	Actual
Commits	5	12 (240%)
Duration	6 hrs	13 hrs
Tickets	-	DEPYLER-0616→0627

The LLM fixed bugs, wrote tests, passed clippy, and committed—no human intervention.

Error Distribution

Top rustc errors from transpilation attempts:

Code	Count	Issue
E0308	1,050	Type mismatch
E0433	706	Failed to resolve
E0599	543	Method not found
E0425	392	Cannot find value
E0277	380	Trait bound not satisfied

Each error type becomes training data for the oracle.

Quality Assurance (Toyota Way)

The corpus implements quality gates inspired by Lean/Toyota principles:

Gate	Tool	Purpose
Golden Traces	`make corpus-golden-export`	50 human-verified fix patterns
Clippy Gate	`make corpus-clippy-check`	Idiomatic Rust verification
HITL Review	`make corpus-hitl-sample`	Quarterly expert review (5% sample)

See docs/specifications/corpus-quality-review.md for the full Toyota Way design review.

Quick Commands

# Analyze golden trace candidates
make corpus-golden-analyze

# Run clippy quality gate
make corpus-clippy-check

# Generate HITL review sample
make corpus-hitl-sample

Project Structure

reprorusted-python-cli/
├── examples/           # 298 Python CLI examples
│   ├── example_*/      # Individual examples with tests
├── docs/               # Specifications and diagrams
├── scripts/            # Automation
└── Makefile            # CITL commands

Integration

# Transpile single example
depyler transpile examples/example_simple/trivial_cli.py

# Train oracle from corpus
depyler oracle train --corpus ./examples

# Export for downstream ML
depyler oracle export-oip --output ./citl.jsonl

Related Projects

Project	Role
depyler	Python→Rust transpiler
aprender	ML library, `.apr` format
entrenar	CITL pattern storage
alimentar	Dataset loading & publishing
renacer	Decision trace ingestion

HuggingFace Dataset

This corpus is available on HuggingFace for ML training:

from datasets import load_dataset

ds = load_dataset("paiml/depyler-citl")

# 606 Python→Rust pairs, 436 with successful transpilation
for row in ds["train"]:
    print(f"{row['python_file']}: {row['python_lines']} → {row['rust_lines']} lines")

📦 Dataset: huggingface.co/datasets/paiml/depyler-citl

References

Wang et al. (2022). Compilable Neural Code Generation with Compiler Feedback. ACL.
Yasunaga & Liang (2020). Graph-based, Self-Supervised Program Repair from Diagnostic Feedback. ICML.
Dou et al. (2024). StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback. arXiv.

License

MIT License - see LICENSE

@software{reprorusted_python_cli,
  title = {CITL Training Corpus for Depyler},
  author = {PAIML},
  year = {2025},
  url = {https://github.com/paiml/reprorusted-python-cli}
}

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.github/workflows		.github/workflows
.pmat/work		.pmat/work
benchmarks		benchmarks
data		data
docker/example_simple		docker/example_simple
docs		docs
examples		examples
golden_traces		golden_traces
reports		reports
scripts		scripts
src/reprorusted_python_cli		src/reprorusted_python_cli
tests		tests
.gitignore		.gitignore
.pmat-gates.toml		.pmat-gates.toml
BENCHMARKS.md		BENCHMARKS.md
CHANGELOG.md		CHANGELOG.md
DEBUGGING.md		DEBUGGING.md
EXAMPLES_INDEX.md		EXAMPLES_INDEX.md
GOLDEN_TRACE_INTEGRATION_SUMMARY.md		GOLDEN_TRACE_INTEGRATION_SUMMARY.md
LICENSE		LICENSE
Makefile		Makefile
NEW_EXAMPLES.md		NEW_EXAMPLES.md
README.md		README.md
STATUS.md		STATUS.md
STRESS_TEST_RESULTS.md		STRESS_TEST_RESULTS.md
architecture.svg		architecture.svg
performance-overview.png		performance-overview.png
performance-overview.svg		performance-overview.svg
pmat-quality.toml		pmat-quality.toml
pmat.toml		pmat.toml
pyproject.toml		pyproject.toml
renacer.toml		renacer.toml
roadmap.yaml		roadmap.yaml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

reprorusted-python-cli

The Core Idea

What This Repository Contains

Quick Start

Corpus Extraction

CITL Training Loop

Example Categories

Autonomous Session Results

Error Distribution

Quality Assurance (Toyota Way)

Quick Commands

Project Structure

Integration

Related Projects

HuggingFace Dataset

References

License

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

paiml/reprorusted-python-cli

Folders and files

Latest commit

History

Repository files navigation

reprorusted-python-cli

The Core Idea

What This Repository Contains

Quick Start

Corpus Extraction

CITL Training Loop

Example Categories

Autonomous Session Results

Error Distribution

Quality Assurance (Toyota Way)

Quick Commands

Project Structure

Integration

Related Projects

HuggingFace Dataset

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages