-
Notifications
You must be signed in to change notification settings - Fork 40
Description
Hi, I start run T-GRPO training and encounter this, after several round, loss is always 0, all_wrong=1.0
I wonder what's the problem is. I use the dataset you provided,
and beacuse flash-atten doesn't support torch.float32, I add this:
model_init_kwargs.setdefault("dtype", torch.bfloat16)
becide this one, there is no major change.
**
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
Caching is incompatible with gradient checkpointing in Qwen2_5_VLDecoderLayer. Setting past_key_values=None.
UserWarning: None of the inputs have requires_grad=True. Gradients will be None
_problem_id: 73498
prompt_length: 227
Invalidate trace cache @ step 0 and module 3048: cache has only 0 modules
tensor([0., 0., 0., 0.], device='cuda:2')
tensor([0., 0., 0., 0.], device='cuda:0')tensor([222, 768, 768, 768], device='cuda:2')
tensor([222, 768, 768, 768], device='cuda:0')
tensor([0., 0., 0., 0.], device='cuda:1')
tensor([222, 768, 768, 768], device='cuda:1')
tensor([0., 0., 0., 0.], device='cuda:3')
tensor([768, 768, 250, 332], device='cuda:3')
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 9.999999977182372e-07, 'completion_length': 606.0, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.0, 'all_wrong': 1.0, 'all_correct': 0.0, 'temporal_rewards': 0.625, 'reward': 0.0, 'reward_std': 0.0, 'kl': 0.0, 'epoch': 0.0}
0%| | 3/65768 [03:31<1288:44:45, 70.55s/it]_**
Is