GRPO multi-GPU reproduction issue

Hello! Thanks a lot for your awesome work.

I'm trying to reproduce the results GRPO results using multi-GPU setup. I did not change any of your code and just ran the training using provided notebook. However, the model doesn't seem to learn anything -- there is no any reward improvement from near-zero level. 

Here is the wandb run: https://wandb.ai/tarasovd/GRPO-Qwen-1.5-Instruct-Multi-GPU/runs/ukdas6bn

Do you have any ideas why this might happen? Were there any changes in the code after the launch you reported (sorry, it's hard to track changes for notebooks with git diff)? 

My best guess is that libraries versions might be the issue and I couldn't find those in your repository. Could you please share those or say where they can be found?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRPO multi-GPU reproduction issue #9

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

GRPO multi-GPU reproduction issue #9

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions