Hi, thanks for releasing the Tree-GRPO code and experiments.
I noticed a possible late-stage training instability in the provided
multihopqa-grpo-group4-qwen2.5_7b run. Specifically, the logged metric
critic/score/mean first increases normally and stays around 0.6 for a long
period, but then suddenly decreases sharply near the end of training.
In addition, I tried running a similar experiment with Qwen2.5-3B-instruct and observed
a similar phenomenon: the reward/score collapses, and in my run the final
scores become all zeros.
Could you help clarify whether this behavior is expected? In particular:
- Is the drop in
critic/score/mean caused by policy degradation,
reward parsing failures, or some evaluation/logging issue?
- Are there recommended hyperparameters to avoid this collapse, e.g.,
smaller learning rate, stronger KL regularization, early stopping, or a
smaller rollout/tree expansion budget?
Thanks!
Hi, thanks for releasing the Tree-GRPO code and experiments.
I noticed a possible late-stage training instability in the provided
multihopqa-grpo-group4-qwen2.5_7brun. Specifically, the logged metriccritic/score/meanfirst increases normally and stays around 0.6 for a longperiod, but then suddenly decreases sharply near the end of training.
In addition, I tried running a similar experiment with Qwen2.5-3B-instruct and observed
a similar phenomenon: the reward/score collapses, and in my run the final
scores become all zeros.
Could you help clarify whether this behavior is expected? In particular:
critic/score/meancaused by policy degradation,reward parsing failures, or some evaluation/logging issue?
smaller learning rate, stronger KL regularization, early stopping, or a
smaller rollout/tree expansion budget?
Thanks!