This is the official repository of the ICLR 2026 paper Regret-Guided Search Control for Efficient Learning in AlphaZero.
If you use this work for research, please consider citing our paper as follows:
@inproceedings{
tsai2026rgsc,
title={Regret-Guided Search Control for Efficient Learning in AlphaZero},
author={Yun-Jui Tsai and Wei-Yu Chen and Yan-Ru Ju and Yu-Hung Chang and Ti-Rong Wu},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=Eoiu5iJD71}
}
This repository is built upon MiniZero. The following provide the training results, trained models, instructions to reproduce the experiments in main text.
Outline
RGSC extends AlphaZero by identifying and prioritizing high-regret states as search control openings for self-play in board games.
RGSC guides self-play to begin from states with higher regret, where regret reflects positions that the current agent has not yet mastered.
Models in RGSC paper can be categorized into two types, training from random weight and training from a well-trained model weight. Section 4.2 refers to the comparison of models training from random weight and section 4.3 refers to the models training from a well-trained model weight.
The RGSC program requires a Linux platform with at least one NVIDIA GPU to operate. First, clone this repository:
git clone git@github.com:rlglab/rgsc
cd rgscTrain models by the script, tools/quick-run.sh:
tools/quick-run.sh train GAME_TYPE az END_ITER -conf_file CONFIG_FILE -conf_str CONF_STR
GAME_TYPEsets the target game, e.g.,go, othello, hex.END_ITERsets the total number of iterations for training, e.g.,300.CONFIG_FILEspecifies a configuration file, we have provided the configuration files incfg_example,cfg_example/go_9x9_az.cfgfor 9x9 Gocfg_example/othello_10x10_az.cfgfor 10x10 Othellocfg_example/hex_11x11_az.cfgfor 11x11 Hex
The detail of config setting is listed in Other Tips
Other Tips
CONF_STRsets additional configurations based on the configuration file. e.g.,zero_num_training_step_per_iteration: game steps number per iteration, 160000 for 9x9 Go, 120000 for 11x11 Hex, 10x10 Othellolearner_use_regrethead: (train regret ranking head and regret value head or not; the default is True)env_buf_opening_ratio: (ratio for sample an opening from Prioritized Replay Buffer; the default is 0.5)env_buf_opening_decay_update_ratio: (EMA ratio for buffer update; the default is 0.5)env_buf_size: (PRB size per worker; the default is 100)env_buf_sampling_temperature: (temperature for buffer sampling; the default is 0.1)env_buf_prioritization_method: (buffer sampling method, proportional or rank; the default is proportional)env_enable_mcts_node_buffer: (store mcts node or not; the default is True)
Commands for reproducing experiments:
# Section 4.2: training go model without RGSC (purely AlphaZero)
tools/quick-run.sh train go az 300 -conf_file cfg_example/go_9x9_az.cfg -conf_str learner_use_rankinghead=false:env_buf_opening_ratio=0:env_enable_mcts_node_buffer=false:env_buf_size=0
# Section 4.2: training go model
tools/quick-run.sh train go az 300 -conf_file cfg_example/go_9x9_az.cfg
# Section 4.2: training othello model
tools/quick-run.sh train othello az 300 -conf_file cfg_example/othello_10x10_az.cfg
# Section 4.2: training hex model
tools/quick-run.sh train hex az 300 -conf_file cfg_example/hex_11x11_az.cfg
# Section 4.4: training go model with regret heaf to select
tools/quick-run.sh train go az 300 -conf_file cfg_example/go_9x9_az.cfg -conf_str learner_use_regrethead=true:learner_use_rankinghead=falseFirst download the well-trained_model and place it here
rgsc/
├── well-trained_model/ # <--place it here and rename
│ ├── model/ # well-trained_model model
│ └── sgf/ # well-trained_model sgfTrain models by the script, tools/quick-run.sh:
tools/quick-run.sh train GAME_TYPE az END_ITER -conf_file CONFIG_FILE -conf_str CONF_STR -continue true
Commands for reproducing experiments:
# Section 4.3: conintue training go model with RGSC
tools/quick-run.sh train go az 1840 -conf_file cfg_example/go_9x9_az_continue.cfg -continue true
# Section 4.3: conintue training go model without RGSC (purely AlphaZero)
tools/quick-run.sh train go az 1840 -conf_file cfg_example/go_9x9_az_continue.cfg -conf_str learner_use_rankinghead=false:env_buf_opening_ratio=0:env_enable_mcts_node_buffer=false:env_buf_size=0 -continue trueAfter training, the folder would be:
# Format of folder name:
# "go_9x9": game name
# "az": alphazero algorithm
# "3bx256": network architecture, 3b represents 3 residual blocks and 256 represents 256 filters
# "n200": number of simulations used in MCTS
# "e3b8a0": git commit hash
go_9x9_az_3bx256_n200-e3b8a0
├── analysis/ # figures of the training process
│ ├── accuracy_policy.png # accuracy for policy network
│ ├── Lengths.png # self-play game lengths
│ ├── loss_policy.png # loss for policy network
│ ├── loss_ranking.png # loss for regret ranking network
│ ├── loss_regret.png # loss for regret value network
│ ├── loss_value.png # loss for value network
│ ├── Returns.png # self-play game returns
│ └── Time.png # elapsed training time
├── buffer/ # Openings in Buffer when an iteration is finished
│ └── *.sgf # `1.sgf`, `2.sgf`, ... for the 1st, the 2nd, ...
├── model/ # all network models produced by each optimization step
│ ├── *.pkl # include training step, parameters, optimizer, scheduler
│ └── *.pt # model parameters only (use for testing)
├── opening/ # Sampled opening for each self-play games, empty for games starts from empty board
│ └── *.sgf # `1.sgf`, `2.sgf`, ... for the 1st, the 2nd, ...
├── sgf/ # self-play games of each iteration
│ └── *.sgf # `1.sgf`, `2.sgf`, ... for the 1st, the 2nd, ... iteration, respectively
├── go_9x9_az_3bx256_n200-e3b8a0.cfg # configuration file
├── op.log # the optimization worker log
├── Training.log # the main training log
└── Worker.log # the worker connection logAfter training a model, you can do evalution:
# Fight against the two models
tools/quick-run.sh fight_eval FOLDER1 FOLDER2 [CONF_FILE1] [CONF_FILE2] [INTERVAL] [GAME_NUM]
# To follow the setting of our experiment (e.g. in 9x9 Go)
tools/quick-run.sh fight_eval FOLDER1 FOLDER2 cfg_example/go_9x9_eval.cfg 15 200FOLDER1,FOLDER2: the two model folders to be evaluated. e.g.,go_9x9_az_3bx256_n200-e3b8a0.CONF_FILE1,CONF_FILE2: the configure files (*.cfg) to use; ifCONF_FILE2is unspecified,CONF_FILE1is used. e.g.,15.INTERVAL: the iteration interval between each evaluated model pair.GAMENUM: the number of games to play for each model pair.
After running the matches between models, the match records will be generated in FOLDER1, and you can plot the results:
# Launch a container
scripts/start-container.sh
# Section 4.2: RGSC in board games (Figure 4)
python tools/plot_together_eval.py -in_dir plot_destination -cfg plotinits/elo.ini
# Section 4.3: RGSC on well-trained models (Figure 5)
python tools/plot_together_eval.py -in_dir plot_destination -cfg plotinits/elo_well_trained.ini
# Section 4.4: Comparison between ranking and regret in RGSC (Figure 6)
python tools/plot_together_eval.py -in_dir plot_destination -cfg plotinits/rank_mse.ini- You can then update the
[directories]section inplotinits/elo.iniby adding an entry in the formatdir_NUM=FOLDER1to plot the Elo curves shown in Figures 4-6.
If you need the models from the final iteration for Section 4.2 and Section 4.3, you can get them from the trained models in the paper.
For the evaluation of Table 1, you can find the opponent used in the paper from the following links:
Plot the regret changing chart of a training model:
# Launch a container
scripts/start-container.sh
# Install the required package
pip install seaborn
# Section 4.5: Regret change in prioritized regret buffer (Figure 8)
python3 tools/regret_plot.pyAfter running the command, the chart will be saved to regret_buffer_chart/regret_distribution.png.