Compiler-in-the-Loop Training Corpus for Python→Rust Transpilation
Use LLMs to bootstrap traditional ML, not as a runtime dependency.
This corpus captures decision traces from autonomous LLM sessions, persists them to .apr format, and trains a local ML oracle. After N sessions, the oracle handles common cases without API calls.
| Phase | Model | Cost |
|---|---|---|
| Bootstrap | LLM (Claude/GPT) | $$/hour |
| Capture | Decision traces → .apr |
One-time |
| Steady-state | HNSW + Tarantula | ~$0 |
298 Python CLI examples with 6,745 tests, designed for depyler transpiler training:
| Metric | Value |
|---|---|
| Python Examples | 298 |
| Test Coverage | 6,745 passing |
| Transpilation Rate | 78.5% (476/606) |
| Clippy Clean | 0.4% (progressive) |
| Golden Traces | 50 patterns |
The remaining compilation failures are training signal—each error becomes an (error, fix) pattern for the oracle.
git clone https://github.com/paiml/reprorusted-python-cli.git
cd reprorusted-python-cli
make install # Install dependencies
make test # Validate corpus (6745 tests)
make citl-train # Train oracle from diagnosticsExtract additional training data from CPython stdlib doctests:
make extract-cpython-doctests # Requires alimentarThis extracts ~1,700 doctests to data/corpora/cpython-doctests.parquet. See docs/corpus-extraction.md for full reproducibility details.
Python Corpus → Depyler → Rust Code → rustc → Diagnostics → Oracle
↑ │
└──────────── patterns.apr ◄──────────────────┘
Each transpilation attempt generates compiler diagnostics. These accumulate in .apr format, training the oracle to suggest fixes for future errors.
| Category | Count | Description |
|---|---|---|
| Core CLI | 20+ | argparse, flags, subcommands |
| String Ops | 15+ | split, join, format, strip |
| Math | 20+ | arithmetic, statistics |
| NumPy | 18 | array ops, linear algebra |
| Sklearn | 10 | regression, clustering |
| PyTorch | 10 | tensors, autograd |
| Async | 5 | async/await patterns |
| File I/O | 10+ | pathlib, csv, json |
We ran Claude Code unattended for 13 hours:
| Metric | Target | Actual |
|---|---|---|
| Commits | 5 | 12 (240%) |
| Duration | 6 hrs | 13 hrs |
| Tickets | - | DEPYLER-0616→0627 |
The LLM fixed bugs, wrote tests, passed clippy, and committed—no human intervention.
Top rustc errors from transpilation attempts:
| Code | Count | Issue |
|---|---|---|
| E0308 | 1,050 | Type mismatch |
| E0433 | 706 | Failed to resolve |
| E0599 | 543 | Method not found |
| E0425 | 392 | Cannot find value |
| E0277 | 380 | Trait bound not satisfied |
Each error type becomes training data for the oracle.
The corpus implements quality gates inspired by Lean/Toyota principles:
| Gate | Tool | Purpose |
|---|---|---|
| Golden Traces | make corpus-golden-export |
50 human-verified fix patterns |
| Clippy Gate | make corpus-clippy-check |
Idiomatic Rust verification |
| HITL Review | make corpus-hitl-sample |
Quarterly expert review (5% sample) |
See docs/specifications/corpus-quality-review.md for the full Toyota Way design review.
# Analyze golden trace candidates
make corpus-golden-analyze
# Run clippy quality gate
make corpus-clippy-check
# Generate HITL review sample
make corpus-hitl-samplereprorusted-python-cli/
├── examples/ # 298 Python CLI examples
│ ├── example_*/ # Individual examples with tests
├── docs/ # Specifications and diagrams
├── scripts/ # Automation
└── Makefile # CITL commands
# Transpile single example
depyler transpile examples/example_simple/trivial_cli.py
# Train oracle from corpus
depyler oracle train --corpus ./examples
# Export for downstream ML
depyler oracle export-oip --output ./citl.jsonl| Project | Role |
|---|---|
| depyler | Python→Rust transpiler |
| aprender | ML library, .apr format |
| entrenar | CITL pattern storage |
| alimentar | Dataset loading & publishing |
| renacer | Decision trace ingestion |
This corpus is available on HuggingFace for ML training:
from datasets import load_dataset
ds = load_dataset("paiml/depyler-citl")
# 606 Python→Rust pairs, 436 with successful transpilation
for row in ds["train"]:
print(f"{row['python_file']}: {row['python_lines']} → {row['rust_lines']} lines")📦 Dataset: huggingface.co/datasets/paiml/depyler-citl
- Wang et al. (2022). Compilable Neural Code Generation with Compiler Feedback. ACL.
- Yasunaga & Liang (2020). Graph-based, Self-Supervised Program Repair from Diagnostic Feedback. ICML.
- Dou et al. (2024). StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback. arXiv.
MIT License - see LICENSE
@software{reprorusted_python_cli,
title = {CITL Training Corpus for Depyler},
author = {PAIML},
year = {2025},
url = {https://github.com/paiml/reprorusted-python-cli}
}