Late-stage collapse of critic/score/mean in multihopqa-grpo-group4-qwen2.5_7b

Hi, thanks for releasing the Tree-GRPO code and experiments.

I noticed a possible late-stage training instability in the provided
`multihopqa-grpo-group4-qwen2.5_7b` run. Specifically, the logged metric
`critic/score/mean` first increases normally and stays around 0.6 for a long
period, but then suddenly decreases sharply near the end of training.

In addition, I tried running a similar experiment with Qwen2.5-3B-instruct and observed
a similar phenomenon: the reward/score collapses, and in my run the final
scores become all zeros.

Could you help clarify whether this behavior is expected? In particular:

1. Is the drop in `critic/score/mean` caused by policy degradation,
   reward parsing failures, or some evaluation/logging issue?
2. Are there recommended hyperparameters to avoid this collapse, e.g.,
   smaller learning rate, stronger KL regularization, early stopping, or a
   smaller rollout/tree expansion budget?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Late-stage collapse of critic/score/mean in multihopqa-grpo-group4-qwen2.5_7b #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Late-stage collapse of critic/score/mean in multihopqa-grpo-group4-qwen2.5_7b #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions