Regret-Guided Search Control for Efficient Learning in AlphaZero

This is the official repository of the ICLR 2026 paper Regret-Guided Search Control for Efficient Learning in AlphaZero.

If you use this work for research, please consider citing our paper as follows:

@inproceedings{
    tsai2026rgsc,
    title={Regret-Guided Search Control for Efficient Learning in AlphaZero},
    author={Yun-Jui Tsai and Wei-Yu Chen and Yan-Ru Ju and Yu-Hung Chang and Ti-Rong Wu},
    booktitle={The Fourteenth International Conference on Learning Representations},
    year={2026},
    url={https://openreview.net/forum?id=Eoiu5iJD71}
}

This repository is built upon MiniZero. The following provide the training results, trained models, instructions to reproduce the experiments in main text.

Outline

Overview
Training
Evaluation
Analysis

Overview

RGSC extends AlphaZero by identifying and prioritizing high-regret states as search control openings for self-play in board games. RGSC guides self-play to begin from states with higher regret, where regret reflects positions that the current agent has not yet mastered.

Training

Models in RGSC paper can be categorized into two types, training from random weight and training from a well-trained model weight. Section 4.2 refers to the comparison of models training from random weight and section 4.3 refers to the models training from a well-trained model weight.

Prerequisites

The RGSC program requires a Linux platform with at least one NVIDIA GPU to operate. First, clone this repository:

git clone git@github.com:rlglab/rgsc
cd rgsc

Training from random weight

Train models by the script, tools/quick-run.sh:

tools/quick-run.sh train GAME_TYPE az END_ITER -conf_file CONFIG_FILE -conf_str CONF_STR

GAME_TYPE sets the target game, e.g., go, othello, hex.
END_ITER sets the total number of iterations for training, e.g., 300.
CONFIG_FILE specifies a configuration file, we have provided the configuration files in cfg_example,
- cfg_example/go_9x9_az.cfg for 9x9 Go
- cfg_example/othello_10x10_az.cfg for 10x10 Othello
- cfg_example/hex_11x11_az.cfg for 11x11 Hex

The detail of config setting is listed in Other Tips

Other Tips

CONF_STR sets additional configurations based on the configuration file. e.g.,
- zero_num_training_step_per_iteration: game steps number per iteration, 160000 for 9x9 Go, 120000 for 11x11 Hex, 10x10 Othello
- learner_use_regrethead: (train regret ranking head and regret value head or not; the default is True)
- env_buf_opening_ratio: (ratio for sample an opening from Prioritized Replay Buffer; the default is 0.5)
- env_buf_opening_decay_update_ratio: (EMA ratio for buffer update; the default is 0.5)
- env_buf_size: (PRB size per worker; the default is 100)
- env_buf_sampling_temperature: (temperature for buffer sampling; the default is 0.1)
- env_buf_prioritization_method: (buffer sampling method, proportional or rank; the default is proportional)
- env_enable_mcts_node_buffer: (store mcts node or not; the default is True)

Commands for reproducing experiments:

# Section 4.2: training go model without RGSC (purely AlphaZero)
tools/quick-run.sh train go az 300 -conf_file cfg_example/go_9x9_az.cfg -conf_str learner_use_rankinghead=false:env_buf_opening_ratio=0:env_enable_mcts_node_buffer=false:env_buf_size=0
# Section 4.2: training go model 
tools/quick-run.sh train go az 300 -conf_file cfg_example/go_9x9_az.cfg
# Section 4.2: training othello model 
tools/quick-run.sh train othello az 300 -conf_file cfg_example/othello_10x10_az.cfg
# Section 4.2: training hex model 
tools/quick-run.sh train hex az 300 -conf_file cfg_example/hex_11x11_az.cfg
# Section 4.4: training go model with regret heaf to select
tools/quick-run.sh train go az 300 -conf_file cfg_example/go_9x9_az.cfg -conf_str learner_use_regrethead=true:learner_use_rankinghead=false

Training from a well-trained model

First download the well-trained_model and place it here

rgsc/
├── well-trained_model/                # <--place it here and rename
│   ├── model/                         # well-trained_model model
│   └── sgf/                           # well-trained_model sgf

Train models by the script, tools/quick-run.sh:

tools/quick-run.sh train GAME_TYPE az END_ITER -conf_file CONFIG_FILE -conf_str CONF_STR -continue true

Commands for reproducing experiments:

# Section 4.3: conintue training go model with RGSC
tools/quick-run.sh train go az 1840 -conf_file cfg_example/go_9x9_az_continue.cfg -continue true
# Section 4.3: conintue training go model without RGSC (purely AlphaZero)
tools/quick-run.sh train go az 1840 -conf_file cfg_example/go_9x9_az_continue.cfg -conf_str learner_use_rankinghead=false:env_buf_opening_ratio=0:env_enable_mcts_node_buffer=false:env_buf_size=0 -continue true

Training Results

After training, the folder would be:

# Format of folder name:
# "go_9x9": game name
# "az": alphazero algorithm
# "3bx256": network architecture, 3b represents 3 residual blocks and 256 represents 256 filters
# "n200": number of simulations used in MCTS
# "e3b8a0": git commit hash
go_9x9_az_3bx256_n200-e3b8a0
├── analysis/                        # figures of the training process
│   ├── accuracy_policy.png          # accuracy for policy network
│   ├── Lengths.png                  # self-play game lengths
│   ├── loss_policy.png              # loss for policy network
│   ├── loss_ranking.png             # loss for regret ranking network
│   ├── loss_regret.png              # loss for regret value network
│   ├── loss_value.png               # loss for value network
│   ├── Returns.png                  # self-play game returns
│   └── Time.png                     # elapsed training time
├── buffer/                          # Openings in Buffer when an iteration is finished
│   └── *.sgf                        # `1.sgf`, `2.sgf`, ... for the 1st, the 2nd, ... 
├── model/                           # all network models produced by each optimization step
│   ├── *.pkl                        # include training step, parameters, optimizer, scheduler
│   └── *.pt                         # model parameters only (use for testing)
├── opening/                         # Sampled opening for each self-play games, empty for games starts from empty board
│   └── *.sgf                        # `1.sgf`, `2.sgf`, ... for the 1st, the 2nd, ... 
├── sgf/                             # self-play games of each iteration
│   └── *.sgf                        # `1.sgf`, `2.sgf`, ... for the 1st, the 2nd, ... iteration, respectively
├── go_9x9_az_3bx256_n200-e3b8a0.cfg # configuration file
├── op.log                           # the optimization worker log
├── Training.log                     # the main training log
└── Worker.log                       # the worker connection log

Evaluation

After training a model, you can do evalution:

# Fight against the two models
tools/quick-run.sh fight_eval FOLDER1 FOLDER2 [CONF_FILE1] [CONF_FILE2] [INTERVAL] [GAME_NUM]
# To follow the setting of our experiment (e.g. in 9x9 Go)
tools/quick-run.sh fight_eval FOLDER1 FOLDER2 cfg_example/go_9x9_eval.cfg 15 200

FOLDER1, FOLDER2: the two model folders to be evaluated. e.g., go_9x9_az_3bx256_n200-e3b8a0.
CONF_FILE1, CONF_FILE2: the configure files (*.cfg) to use; if CONF_FILE2 is unspecified, CONF_FILE1 is used. e.g., 15.
INTERVAL: the iteration interval between each evaluated model pair.
GAMENUM: the number of games to play for each model pair.

After running the matches between models, the match records will be generated in FOLDER1, and you can plot the results:

# Launch a container
scripts/start-container.sh
# Section 4.2: RGSC in board games (Figure 4)
python tools/plot_together_eval.py -in_dir plot_destination -cfg plotinits/elo.ini
# Section 4.3: RGSC on well-trained models (Figure 5)
python tools/plot_together_eval.py -in_dir plot_destination -cfg plotinits/elo_well_trained.ini
# Section 4.4: Comparison between ranking and regret in RGSC (Figure 6)
python tools/plot_together_eval.py -in_dir plot_destination -cfg plotinits/rank_mse.ini

You can then update the [directories] section in plotinits/elo.ini by adding an entry in the format dir_NUM=FOLDER1 to plot the Elo curves shown in Figures 4-6.

If you need the models from the final iteration for Section 4.2 and Section 4.3, you can get them from the trained models in the paper.
For the evaluation of Table 1, you can find the opponent used in the paper from the following links:

Analysis

Plot the regret changing chart of a training model:

# Launch a container
scripts/start-container.sh
# Install the required package
pip install seaborn
# Section 4.5: Regret change in prioritized regret buffer (Figure 8)
python3 tools/regret_plot.py

After running the command, the chart will be saved to regret_buffer_chart/regret_distribution.png.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.githooks		.githooks
cfg_example		cfg_example
docs		docs
minizero		minizero
plotinits		plotinits
scripts		scripts
tools		tools
.autopep8		.autopep8
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Regret-Guided Search Control for Efficient Learning in AlphaZero

Overview

Training

Prerequisites

Training from random weight

Training from a well-trained model

Training Results

Evaluation

Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Regret-Guided Search Control for Efficient Learning in AlphaZero

Overview

Training

Prerequisites

Training from random weight

Training from a well-trained model

Training Results

Evaluation

Analysis

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages