Skip to content

rlglab/rgsc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Regret-Guided Search Control for Efficient Learning in AlphaZero

This is the official repository of the ICLR 2026 paper Regret-Guided Search Control for Efficient Learning in AlphaZero.

If you use this work for research, please consider citing our paper as follows:

@inproceedings{
    tsai2026rgsc,
    title={Regret-Guided Search Control for Efficient Learning in AlphaZero},
    author={Yun-Jui Tsai and Wei-Yu Chen and Yan-Ru Ju and Yu-Hung Chang and Ti-Rong Wu},
    booktitle={The Fourteenth International Conference on Learning Representations},
    year={2026},
    url={https://openreview.net/forum?id=Eoiu5iJD71}
}

This repository is built upon MiniZero. The following provide the training results, trained models, instructions to reproduce the experiments in main text.

Outline

Overview

RGSC extends AlphaZero by identifying and prioritizing high-regret states as search control openings for self-play in board games. RGSC guides self-play to begin from states with higher regret, where regret reflects positions that the current agent has not yet mastered. RGSC Overview

Training

Models in RGSC paper can be categorized into two types, training from random weight and training from a well-trained model weight. Section 4.2 refers to the comparison of models training from random weight and section 4.3 refers to the models training from a well-trained model weight.

Prerequisites

The RGSC program requires a Linux platform with at least one NVIDIA GPU to operate. First, clone this repository:

git clone git@github.com:rlglab/rgsc
cd rgsc

Training from random weight

Train models by the script, tools/quick-run.sh:

tools/quick-run.sh train GAME_TYPE az END_ITER -conf_file CONFIG_FILE -conf_str CONF_STR
  • GAME_TYPE sets the target game, e.g., go, othello, hex.
  • END_ITER sets the total number of iterations for training, e.g., 300.
  • CONFIG_FILE specifies a configuration file, we have provided the configuration files in cfg_example,
    • cfg_example/go_9x9_az.cfg for 9x9 Go
    • cfg_example/othello_10x10_az.cfg for 10x10 Othello
    • cfg_example/hex_11x11_az.cfg for 11x11 Hex

The detail of config setting is listed in Other Tips

Other Tips
  • CONF_STR sets additional configurations based on the configuration file. e.g.,
    • zero_num_training_step_per_iteration: game steps number per iteration, 160000 for 9x9 Go, 120000 for 11x11 Hex, 10x10 Othello
    • learner_use_regrethead: (train regret ranking head and regret value head or not; the default is True)
    • env_buf_opening_ratio: (ratio for sample an opening from Prioritized Replay Buffer; the default is 0.5)
    • env_buf_opening_decay_update_ratio: (EMA ratio for buffer update; the default is 0.5)
    • env_buf_size: (PRB size per worker; the default is 100)
    • env_buf_sampling_temperature: (temperature for buffer sampling; the default is 0.1)
    • env_buf_prioritization_method: (buffer sampling method, proportional or rank; the default is proportional)
    • env_enable_mcts_node_buffer: (store mcts node or not; the default is True)

Commands for reproducing experiments:

# Section 4.2: training go model without RGSC (purely AlphaZero)
tools/quick-run.sh train go az 300 -conf_file cfg_example/go_9x9_az.cfg -conf_str learner_use_rankinghead=false:env_buf_opening_ratio=0:env_enable_mcts_node_buffer=false:env_buf_size=0
# Section 4.2: training go model 
tools/quick-run.sh train go az 300 -conf_file cfg_example/go_9x9_az.cfg
# Section 4.2: training othello model 
tools/quick-run.sh train othello az 300 -conf_file cfg_example/othello_10x10_az.cfg
# Section 4.2: training hex model 
tools/quick-run.sh train hex az 300 -conf_file cfg_example/hex_11x11_az.cfg
# Section 4.4: training go model with regret heaf to select
tools/quick-run.sh train go az 300 -conf_file cfg_example/go_9x9_az.cfg -conf_str learner_use_regrethead=true:learner_use_rankinghead=false

Training from a well-trained model

First download the well-trained_model and place it here

rgsc/
├── well-trained_model/                # <--place it here and rename
│   ├── model/                         # well-trained_model model
│   └── sgf/                           # well-trained_model sgf

Train models by the script, tools/quick-run.sh:

tools/quick-run.sh train GAME_TYPE az END_ITER -conf_file CONFIG_FILE -conf_str CONF_STR -continue true

Commands for reproducing experiments:

# Section 4.3: conintue training go model with RGSC
tools/quick-run.sh train go az 1840 -conf_file cfg_example/go_9x9_az_continue.cfg -continue true
# Section 4.3: conintue training go model without RGSC (purely AlphaZero)
tools/quick-run.sh train go az 1840 -conf_file cfg_example/go_9x9_az_continue.cfg -conf_str learner_use_rankinghead=false:env_buf_opening_ratio=0:env_enable_mcts_node_buffer=false:env_buf_size=0 -continue true

Training Results

After training, the folder would be:

# Format of folder name:
# "go_9x9": game name
# "az": alphazero algorithm
# "3bx256": network architecture, 3b represents 3 residual blocks and 256 represents 256 filters
# "n200": number of simulations used in MCTS
# "e3b8a0": git commit hash
go_9x9_az_3bx256_n200-e3b8a0
├── analysis/                        # figures of the training process
│   ├── accuracy_policy.png          # accuracy for policy network
│   ├── Lengths.png                  # self-play game lengths
│   ├── loss_policy.png              # loss for policy network
│   ├── loss_ranking.png             # loss for regret ranking network
│   ├── loss_regret.png              # loss for regret value network
│   ├── loss_value.png               # loss for value network
│   ├── Returns.png                  # self-play game returns
│   └── Time.png                     # elapsed training time
├── buffer/                          # Openings in Buffer when an iteration is finished
│   └── *.sgf                        # `1.sgf`, `2.sgf`, ... for the 1st, the 2nd, ... 
├── model/                           # all network models produced by each optimization step
│   ├── *.pkl                        # include training step, parameters, optimizer, scheduler
│   └── *.pt                         # model parameters only (use for testing)
├── opening/                         # Sampled opening for each self-play games, empty for games starts from empty board
│   └── *.sgf                        # `1.sgf`, `2.sgf`, ... for the 1st, the 2nd, ... 
├── sgf/                             # self-play games of each iteration
│   └── *.sgf                        # `1.sgf`, `2.sgf`, ... for the 1st, the 2nd, ... iteration, respectively
├── go_9x9_az_3bx256_n200-e3b8a0.cfg # configuration file
├── op.log                           # the optimization worker log
├── Training.log                     # the main training log
└── Worker.log                       # the worker connection log

Evaluation

After training a model, you can do evalution:

# Fight against the two models
tools/quick-run.sh fight_eval FOLDER1 FOLDER2 [CONF_FILE1] [CONF_FILE2] [INTERVAL] [GAME_NUM]
# To follow the setting of our experiment (e.g. in 9x9 Go)
tools/quick-run.sh fight_eval FOLDER1 FOLDER2 cfg_example/go_9x9_eval.cfg 15 200
  • FOLDER1, FOLDER2: the two model folders to be evaluated. e.g., go_9x9_az_3bx256_n200-e3b8a0.
  • CONF_FILE1, CONF_FILE2: the configure files (*.cfg) to use; if CONF_FILE2 is unspecified, CONF_FILE1 is used. e.g., 15.
  • INTERVAL: the iteration interval between each evaluated model pair.
  • GAMENUM: the number of games to play for each model pair.

After running the matches between models, the match records will be generated in FOLDER1, and you can plot the results:

# Launch a container
scripts/start-container.sh
# Section 4.2: RGSC in board games (Figure 4)
python tools/plot_together_eval.py -in_dir plot_destination -cfg plotinits/elo.ini
# Section 4.3: RGSC on well-trained models (Figure 5)
python tools/plot_together_eval.py -in_dir plot_destination -cfg plotinits/elo_well_trained.ini
# Section 4.4: Comparison between ranking and regret in RGSC (Figure 6)
python tools/plot_together_eval.py -in_dir plot_destination -cfg plotinits/rank_mse.ini
  • You can then update the [directories] section in plotinits/elo.ini by adding an entry in the format dir_NUM=FOLDER1 to plot the Elo curves shown in Figures 4-6.

If you need the models from the final iteration for Section 4.2 and Section 4.3, you can get them from the trained models in the paper.
For the evaluation of Table 1, you can find the opponent used in the paper from the following links:

Analysis

Plot the regret changing chart of a training model:

# Launch a container
scripts/start-container.sh
# Install the required package
pip install seaborn
# Section 4.5: Regret change in prioritized regret buffer (Figure 8)
python3 tools/regret_plot.py

After running the command, the chart will be saved to regret_buffer_chart/regret_distribution.png.

About

[ICLR 2026] A regret-guided search control method that extends AlphaZero with regret-guided restarts for more efficient and robust learning in board games

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors