An OpenEnv-compliant reinforcement learning environment for evaluating autonomous code refactoring agents. Agents iteratively apply AST transformations to real Python codebases, improving static analysis metrics while keeping the test suite green after every step.
No existing OpenEnv environment evaluates agents on code quality improvement — a task real developers perform daily. CRAiG fills this gap: agents must reason about code structure, apply targeted refactoring transformations, and verify correctness via tests. This is an immediate benchmark utility for evaluating coding LLMs beyond correctness.
Each episode:
- Agent receives a Python codebase snapshot with quality metrics and test results
- Agent applies one refactoring transformation per step
- Tests run automatically — failures trigger rollback (environment stays valid)
- Reward = normalised metric improvement + coverage bonus − penalties
- Episode ends when step budget exhausted or target metric threshold reached
Key properties:
- Stateful, multi-step episodes (15–30 steps)
- Deterministic rollback on test failures
- Clear, measurable success criteria per task
- Meaningful partial credit throughout the trajectory
| Transformation | Parameters | Description |
|---|---|---|
extract_function |
start_line: int, end_line: int, new_name: str |
Extract line range into a new named function |
rename_symbol |
old_name: str, new_name: str |
Rename variable/function/class throughout file |
split_class |
class_name: str, methods: list[str], new_class_name: str |
Move methods to a new class |
remove_dead_code |
symbol_name: str |
Delete unused function or class (vulture-confirmed) |
extract_constant |
value: str|int|float, name: str |
Replace magic literals with named constant |
flatten_condition |
function_name: str |
Convert if/else to early-return guard clauses |
Action JSON schema:
{
"file_path": "src/calculator.py",
"transformation": "extract_function",
"parameters": {"start_line": 3, "end_line": 5, "new_name": "validate_positive"}
}| Field | Type | Description |
|---|---|---|
file_tree |
list[str] |
All .py files relative to workdir |
file_contents |
dict[str, str] |
Source text of each file |
metrics.cyclomatic_complexity |
float |
Average McCabe CC (radon) |
metrics.duplication_score |
float [0,1] |
Fraction of similar function-body pairs |
metrics.test_coverage |
float [0,1] |
Line coverage (pytest-cov) |
metrics.dead_code_ratio |
float [0,1] |
Fraction unused symbols (vulture) |
test_results |
object | passed / failed / errored / total / output |
steps_remaining |
int |
Budget left in this episode |
last_action_valid |
bool |
Whether previous action applied successfully |
last_action_error |
str? |
Error message if action was rejected/rolled back |
| ID | Name | Difficulty | Primary Metric | Target | Max Steps |
|---|---|---|---|---|---|
easy |
Duplicate Code Extraction | 🟢 Easy | duplication_score |
< 0.10 |
15 |
medium |
Cyclomatic Complexity Reduction | 🟡 Medium | cyclomatic_complexity |
< 4.0 |
20 |
hard |
God Class Decomposition | 🔴 Hard | combined | > 0.75 |
30 |
Easy — scenarios/easy/src/calculator.py
A geometry calculator with 10 functions (areas + perimeters for 5 shapes). Each function independently validates its inputs with near-identical guard code. Agent must extract the shared validation into a helper to reduce duplication_score from ~0.55 to < 0.10.
Medium — scenarios/medium/src/processor.py
A request processor with deeply nested if/elif/else chains (CC ≈ 12). Agent must flatten conditions and extract helpers to bring average cyclomatic complexity below 4.0.
Hard — scenarios/hard/src/manager.py
An ApplicationManager god class spanning auth, data management, notifications, and reporting (18 methods, 2 dead-code methods, CC ≈ 11, dead_code_ratio ≈ 0.30). Agent must decompose the class while keeping 30 inter-coupled tests green.
Reward = 0.5 (base)
+ normalised_metric_delta × 0.45 (improvement signal)
+ coverage_bonus (+0.02 maintained, -0.05 dropped)
Penalties (applied instead of improvement reward):
invalid action : reward = 0.45
test regression : reward = 0.40 (rollback applied)
transform error : reward = 0.45
Reward is centred at 0.5 so penalties are visible and distinct from zero.
pip install -e .
# Smoke test — all three scenarios should reset cleanly
python -c "
from refactoring_gym.env import RefactoringGym
for s in ['easy','medium','hard']:
env = RefactoringGym(s)
obs = env.reset()
print(s, obs.metrics)
"
# Run gym unit tests
pytest tests/ -v
# Start the server
uvicorn server.app:app --host 0.0.0.0 --port 7860docker build -t craig:latest .
docker run -p 7860:7860 craig:latest
# Check health
curl http://localhost:7860/health
# Reset easy scenario
curl -X POST http://localhost:7860/reset?scenario=easy# Copy and fill in your API key
cp .env.example .env # then edit .env
python inference.py
# or with a specific model:
MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct python inference.pySupported providers (set HF_TOKEN in .env):
- Hugging Face Inference API —
HF_TOKEN=hf_... - OpenRouter — set
OPENROUTER_API_KEYand updatebase_urlininference.py
Progress is streamed to stderr; structured JSON logs go to stdout.
| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Liveness check |
POST |
/reset?scenario=easy |
Start new episode, returns Observation |
POST |
/step?scenario=easy |
Apply action, returns {observation, reward, done, info} |
GET |
/state?scenario=easy |
Serialisable environment state |
GET |
/tasks |
List all tasks with metadata |
| Task | CC | Duplication | Coverage | Dead Code | Tests |
|---|---|---|---|---|---|
easy |
7.00 | 0.53 | 0.91 | 0.00 | 17/17 ✅ |
medium |
21.00 | 0.00 | 0.82 | 0.62 | 15/15 ✅ |
hard |
7.00 | 0.00 | 0.93 | 0.08 | 31/31 ✅ |
Verify locally with no API key:
from refactoring_gym.env import RefactoringGym
for s in ["easy", "medium", "hard"]:
obs = RefactoringGym(s).reset()
print(s, obs.metrics)| Task | Grade | Steps | Notes |
|---|---|---|---|
easy |
0.7602 | 7/15 | Best reached at step 4 (duplication 0.53 → 0.1319), then early-stopped by no-improvement patience |
medium |
0.5000 | 4/20 | Medium bootstrap reduced max CC from 21.0 → 12.0 in step 1 |
hard |
0.4964 | 7/30 | remove_dead_code actions improved dead-code ratio to 0.0; later split_class attempts regressed |
mean |
0.5855 | — | Mean of easy/medium/hard grades |
Reproduced via per-scenario runs to conserve token budget:
python inference.py --scenario easy|medium|hard
from pathlib import Path
from refactoring_gym.scenarios.base import Scenario, register
from refactoring_gym.models import CodeMetrics
@register("my_task")
class MyScenario(Scenario):
snapshot_path = Path("scenarios/my_task")
max_steps = 20
max_single_step_delta = 2.0
def is_solved(self, metrics: CodeMetrics) -> bool:
return metrics.cyclomatic_complexity < 3.0
def metric_delta(self, prev: CodeMetrics, new: CodeMetrics) -> float:
return prev.cyclomatic_complexity - new.cyclomatic_complexity
def grade(self, metrics: CodeMetrics) -> float:
return max(0.0, min(1.0, (10.0 - metrics.cyclomatic_complexity) / 10.0))