Skip to content

How do you handle timeout for broadcasting? #48

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zfj1998 opened this issue Apr 27, 2025 · 1 comment
Open

How do you handle timeout for broadcasting? #48

zfj1998 opened this issue Apr 27, 2025 · 1 comment

Comments

@zfj1998
Copy link

zfj1998 commented Apr 27, 2025

我注意到您在这个地方用了broadcast来同步group rank的数据。但我有一些疑问:

  1. 这里直接用src=0,如何保证group 1的rank 0不会错误地broadcast到其它group的ranks?
  2. 如何避免tool call花费时间太长,导致broadcast超时?
  3. 为什么不在ray_trainer实现tool call的逻辑,这样就不需要对vllm的model parallel做同步了

broadcast_data = vllm_ps._TP.broadcast_object(broadcast_data, src=0)

@AnselCmy
Copy link
Collaborator

  1. vllm_ps._TP.broadcast_object可以保证只广播到一个tp组
  2. 这一行做了超时处理
  3. 这确实也是一个思路,你有比较好的具体的实现思路吗?如果有的话可以分享出来讨论

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants