Skip to content

T-GRPO training from official SFT checkpoint: loss always 0, reward always 0, all_wrong=1.0 #101

@Quibbler6

Description

@Quibbler6

Hi, I start run T-GRPO training and encounter this, after several round, loss is always 0, all_wrong=1.0
I wonder what's the problem is. I use the dataset you provided,
and beacuse flash-atten doesn't support torch.float32, I add this:
model_init_kwargs.setdefault("dtype", torch.bfloat16)
becide this one, there is no major change.

**
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
Caching is incompatible with gradient checkpointing in Qwen2_5_VLDecoderLayer. Setting past_key_values=None.
UserWarning: None of the inputs have requires_grad=True. Gradients will be None
_problem_id: 73498
prompt_length: 227
Invalidate trace cache @ step 0 and module 3048: cache has only 0 modules
tensor([0., 0., 0., 0.], device='cuda:2')
tensor([0., 0., 0., 0.], device='cuda:0')tensor([222, 768, 768, 768], device='cuda:2')

tensor([222, 768, 768, 768], device='cuda:0')
tensor([0., 0., 0., 0.], device='cuda:1')
tensor([222, 768, 768, 768], device='cuda:1')
tensor([0., 0., 0., 0.], device='cuda:3')
tensor([768, 768, 250, 332], device='cuda:3')
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 9.999999977182372e-07, 'completion_length': 606.0, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.0, 'all_wrong': 1.0, 'all_correct': 0.0, 'temporal_rewards': 0.625, 'reward': 0.0, 'reward_std': 0.0, 'kl': 0.0, 'epoch': 0.0}
0%| | 3/65768 [03:31<1288:44:45, 70.55s/it]_**

Is

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions