Refact-Bench is a benchmarking tool designed to evaluate AI models on software engineering tasks using the SWE-Bench framework. It provides a standardized environment for testing model performance on real-world programming challenges extracted from GitHub issues.
Before installing Refact-Bench, ensure you have the following:
- Python 3.7 or higher
- Docker installed and running
- Git
- pip package manager
First, install the required Python packages:
pip install -e .This will install all dependencies listed in setup.py, including the refact package.
Clone the Refact repository and build the necessary components. To reproduce SWE evaluation results, you need to use following branches of refact:
- https://github.com/smallcloudai/refact/tree/swe-boosted-prompt for SWE-lite
- https://github.com/smallcloudai/refact/tree/swe-boosted-prompt-verified for SWE-verified
git clone https://github.com/smallcloudai/refact.git
pip install -e ./refact/refact-agent/engine/python_binding_and_cmdline
fakeide compile-static-lsp releaseThis step compiles the Language Server Protocol (LSP) implementation needed for code analysis.
cd ./refact/refact-agent/engine/
cargo build --release
mkdir -p ./python_binding_and_cmdline/refact/bin/
cp ./target/release/refact-lsp ./python_binding_and_cmdline/refact/bin/refact-lspThis builds the Rust-based LSP binary and places it in the correct location for the Python package to use.
Create a Docker integration configuration file:
mkdir -p ~/.config/refact/integrations.d/Then set up the Docker integration configuration (this will overwrite any existing Docker integration config):
cat > ~/.config/refact/integrations.d/docker.yaml << 'EOF'
label: ref
docker_daemon_address: ''
docker_cli_path: docker
remote_docker: false
ssh_host: ''
ssh_user: root
ssh_port: '22'
ssh_identity_file: ''
available:
on_your_laptop: true
when_isolated: true
confirmation:
ask_user: []
deny: []
EOFThis configuration allows Refact-Bench to use Docker for creating isolated environments for each benchmark task.
To run a benchmark task, use the fakeide run command. For example, to run the swe-verified tasks using the claude-3-7-sonnet model:
fakeide run --api-key <REFACT-API-KEY> --model claude-3-7-sonnet --docker tasks/swe/verified --experiment my-experimentReplace <API-KEY> with Refact key and my-experiment with a name to group your benchmark runs.
To collect the results after running tasks:
fakeide collect --experiment my-experimentThe results of the benchmark will be stored in ./results/
If you want to test models on your self-hosted server, specify the --address-url parameter with your local address:
fakeide run --address-url http://localhost:8080 --api-key <API-KEY> --model refact/claude-3-7-sonnet --docker tasks/swe/verifiedNote: Your server should be started on 0.0.0.0. A common use case with node is:
ssh <node-name> -L 0.0.0.0:8008:0.0.0.0:8008The translation.py script in the tasks/swe directory is used to prepare SWE-Bench tasks for evaluation. It converts SWE-Bench datasets into the format required by Refact-Bench:
cd tasks/swe
python translation.pyThis script processes the SWE-Bench datasets (Lite, Lite-dev, and Verified) and generates the necessary task files in the respective directories.
The main components of Refact-Bench are:
refact_scenarios/: Core Python package with the implementation of the benchmarking frameworktasks/: Contains the benchmark tasksswe/: SWE-Bench related tasksverified/: Verified SWE-Bench taskslite/: SWE-Bench Lite taskslite-dev/: Development subset of SWE-Bench Litetranslation.py: Script to prepare SWE-Bench tasks
fakeide-logs/: Contains logs from benchmark runs
check the logs in the fakeide-logs/ directory.