Skip to content

Late-stage collapse of critic/score/mean in multihopqa-grpo-group4-qwen2.5_7b #5

@e3trange

Description

@e3trange

Hi, thanks for releasing the Tree-GRPO code and experiments.

I noticed a possible late-stage training instability in the provided
multihopqa-grpo-group4-qwen2.5_7b run. Specifically, the logged metric
critic/score/mean first increases normally and stays around 0.6 for a long
period, but then suddenly decreases sharply near the end of training.

In addition, I tried running a similar experiment with Qwen2.5-3B-instruct and observed
a similar phenomenon: the reward/score collapses, and in my run the final
scores become all zeros.

Could you help clarify whether this behavior is expected? In particular:

  1. Is the drop in critic/score/mean caused by policy degradation,
    reward parsing failures, or some evaluation/logging issue?
  2. Are there recommended hyperparameters to avoid this collapse, e.g.,
    smaller learning rate, stronger KL regularization, early stopping, or a
    smaller rollout/tree expansion budget?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions