How do you handle timeout for broadcasting? #48

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

zfj1998 opened this issue Apr 27, 2025 · 1 comment

zfj1998 commented Apr 27, 2025

我注意到您在这个地方用了broadcast来同步group rank的数据。但我有一些疑问：

这里直接用src=0，如何保证group 1的rank 0不会错误地broadcast到其它group的ranks？
如何避免tool call花费时间太长，导致broadcast超时？
为什么不在ray_trainer实现tool call的逻辑，这样就不需要对vllm的model parallel做同步了

ReCall/src/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py

Line 547 in 3d976d2

broadcast_data = vllm_ps._TP.broadcast_object(broadcast_data, src=0)

Collaborator

AnselCmy commented Apr 29, 2025

vllm_ps._TP.broadcast_object可以保证只广播到一个tp组
在这一行做了超时处理
这确实也是一个思路，你有比较好的具体的实现思路吗？如果有的话可以分享出来讨论

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment