Skip to content

Pandemonium-Research/CRAiG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CRAiG — Code Refactoring AI Gym

OpenEnv

An OpenEnv-compliant reinforcement learning environment for evaluating autonomous code refactoring agents. Agents iteratively apply AST transformations to real Python codebases, improving static analysis metrics while keeping the test suite green after every step.

Motivation

No existing OpenEnv environment evaluates agents on code quality improvement — a task real developers perform daily. CRAiG fills this gap: agents must reason about code structure, apply targeted refactoring transformations, and verify correctness via tests. This is an immediate benchmark utility for evaluating coding LLMs beyond correctness.


Environment Description

Each episode:

  1. Agent receives a Python codebase snapshot with quality metrics and test results
  2. Agent applies one refactoring transformation per step
  3. Tests run automatically — failures trigger rollback (environment stays valid)
  4. Reward = normalised metric improvement + coverage bonus − penalties
  5. Episode ends when step budget exhausted or target metric threshold reached

Key properties:

  • Stateful, multi-step episodes (15–30 steps)
  • Deterministic rollback on test failures
  • Clear, measurable success criteria per task
  • Meaningful partial credit throughout the trajectory

Action Space

Transformation Parameters Description
extract_function start_line: int, end_line: int, new_name: str Extract line range into a new named function
rename_symbol old_name: str, new_name: str Rename variable/function/class throughout file
split_class class_name: str, methods: list[str], new_class_name: str Move methods to a new class
remove_dead_code symbol_name: str Delete unused function or class (vulture-confirmed)
extract_constant value: str|int|float, name: str Replace magic literals with named constant
flatten_condition function_name: str Convert if/else to early-return guard clauses

Action JSON schema:

{
  "file_path": "src/calculator.py",
  "transformation": "extract_function",
  "parameters": {"start_line": 3, "end_line": 5, "new_name": "validate_positive"}
}

Observation Space

Field Type Description
file_tree list[str] All .py files relative to workdir
file_contents dict[str, str] Source text of each file
metrics.cyclomatic_complexity float Average McCabe CC (radon)
metrics.duplication_score float [0,1] Fraction of similar function-body pairs
metrics.test_coverage float [0,1] Line coverage (pytest-cov)
metrics.dead_code_ratio float [0,1] Fraction unused symbols (vulture)
test_results object passed / failed / errored / total / output
steps_remaining int Budget left in this episode
last_action_valid bool Whether previous action applied successfully
last_action_error str? Error message if action was rejected/rolled back

Tasks

ID Name Difficulty Primary Metric Target Max Steps
easy Duplicate Code Extraction 🟢 Easy duplication_score < 0.10 15
medium Cyclomatic Complexity Reduction 🟡 Medium cyclomatic_complexity < 4.0 20
hard God Class Decomposition 🔴 Hard combined > 0.75 30

Task Details

Easyscenarios/easy/src/calculator.py
A geometry calculator with 10 functions (areas + perimeters for 5 shapes). Each function independently validates its inputs with near-identical guard code. Agent must extract the shared validation into a helper to reduce duplication_score from ~0.55 to < 0.10.

Mediumscenarios/medium/src/processor.py
A request processor with deeply nested if/elif/else chains (CC ≈ 12). Agent must flatten conditions and extract helpers to bring average cyclomatic complexity below 4.0.

Hardscenarios/hard/src/manager.py
An ApplicationManager god class spanning auth, data management, notifications, and reporting (18 methods, 2 dead-code methods, CC ≈ 11, dead_code_ratio ≈ 0.30). Agent must decompose the class while keeping 30 inter-coupled tests green.


Reward Function

Reward = 0.5 (base)
       + normalised_metric_delta × 0.45   (improvement signal)
       + coverage_bonus                    (+0.02 maintained, -0.05 dropped)

Penalties (applied instead of improvement reward):
  invalid action  : reward = 0.45
  test regression : reward = 0.40  (rollback applied)
  transform error : reward = 0.45

Reward is centred at 0.5 so penalties are visible and distinct from zero.


Setup

Local development

pip install -e .

# Smoke test — all three scenarios should reset cleanly
python -c "
from refactoring_gym.env import RefactoringGym
for s in ['easy','medium','hard']:
    env = RefactoringGym(s)
    obs = env.reset()
    print(s, obs.metrics)
"

# Run gym unit tests
pytest tests/ -v

# Start the server
uvicorn server.app:app --host 0.0.0.0 --port 7860

Docker

docker build -t craig:latest .
docker run -p 7860:7860 craig:latest

# Check health
curl http://localhost:7860/health

# Reset easy scenario
curl -X POST http://localhost:7860/reset?scenario=easy

Inference baseline

# Copy and fill in your API key
cp .env.example .env   # then edit .env

python inference.py
# or with a specific model:
MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct python inference.py

Supported providers (set HF_TOKEN in .env):

Progress is streamed to stderr; structured JSON logs go to stdout.


API Endpoints

Method Endpoint Description
GET /health Liveness check
POST /reset?scenario=easy Start new episode, returns Observation
POST /step?scenario=easy Apply action, returns {observation, reward, done, info}
GET /state?scenario=easy Serialisable environment state
GET /tasks List all tasks with metadata

Baseline Scores

Initial environment metrics (reproducible, no API required)

Task CC Duplication Coverage Dead Code Tests
easy 7.00 0.53 0.91 0.00 17/17 ✅
medium 21.00 0.00 0.82 0.62 15/15 ✅
hard 7.00 0.00 0.93 0.08 31/31 ✅

Verify locally with no API key:

from refactoring_gym.env import RefactoringGym
for s in ["easy", "medium", "hard"]:
    obs = RefactoringGym(s).reset()
    print(s, obs.metrics)

Agent baseline — llama-3.3-70b-versatile via Groq (https://api.groq.com/openai/v1)

Task Grade Steps Notes
easy 0.7602 7/15 Best reached at step 4 (duplication 0.53 → 0.1319), then early-stopped by no-improvement patience
medium 0.5000 4/20 Medium bootstrap reduced max CC from 21.0 → 12.0 in step 1
hard 0.4964 7/30 remove_dead_code actions improved dead-code ratio to 0.0; later split_class attempts regressed
mean 0.5855 Mean of easy/medium/hard grades

Reproduced via per-scenario runs to conserve token budget: python inference.py --scenario easy|medium|hard


Extending with Custom Scenarios

from pathlib import Path
from refactoring_gym.scenarios.base import Scenario, register
from refactoring_gym.models import CodeMetrics

@register("my_task")
class MyScenario(Scenario):
    snapshot_path = Path("scenarios/my_task")
    max_steps = 20
    max_single_step_delta = 2.0

    def is_solved(self, metrics: CodeMetrics) -> bool:
        return metrics.cyclomatic_complexity < 3.0

    def metric_delta(self, prev: CodeMetrics, new: CodeMetrics) -> float:
        return prev.cyclomatic_complexity - new.cyclomatic_complexity

    def grade(self, metrics: CodeMetrics) -> float:
        return max(0.0, min(1.0, (10.0 - metrics.cyclomatic_complexity) / 10.0))

About

OpenEnv-compliant reinforcement learning environment for evaluating autonomous code refactoring agents using AST transformations and static analysis metrics.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors