T-GRPO training from official SFT checkpoint: loss always 0, reward always 0, all_wrong=1.0

Hi, I start run T-GRPO training and encounter this, after several round, loss is always 0, all_wrong=1.0
I wonder what's the problem is. I use the dataset you provided, 
and beacuse flash-atten doesn't support torch.float32, I add this:
_model_init_kwargs.setdefault("dtype", torch.bfloat16)_
becide this one, there is no major change.
 
**
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
Caching is incompatible with gradient checkpointing in Qwen2_5_VLDecoderLayer. Setting `past_key_values=None`.
UserWarning: None of the inputs have requires_grad=True. Gradients will be None
_problem_id: 73498
prompt_length: 227
Invalidate trace cache @ step 0 and module 3048: cache has only 0 modules
tensor([0., 0., 0., 0.], device='cuda:2')
tensor([0., 0., 0., 0.], device='cuda:0')tensor([222, 768, 768, 768], device='cuda:2')

tensor([222, 768, 768, 768], device='cuda:0')
tensor([0., 0., 0., 0.], device='cuda:1')
tensor([222, 768, 768, 768], device='cuda:1')
tensor([0., 0., 0., 0.], device='cuda:3')
tensor([768, 768, 250, 332], device='cuda:3')
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 9.999999977182372e-07, 'completion_length': 606.0, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.0, 'all_wrong': 1.0, 'all_correct': 0.0, 'temporal_rewards': 0.625, 'reward': 0.0, 'reward_std': 0.0, 'kl': 0.0, 'epoch': 0.0}
  0%|                                                                                                               | 3/65768 [03:31<1288:44:45, 70.55s/it]_**

Is


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

T-GRPO training from official SFT checkpoint: loss always 0, reward always 0, all_wrong=1.0 #101

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

T-GRPO training from official SFT checkpoint: loss always 0, reward always 0, all_wrong=1.0 #101

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions