-
Notifications
You must be signed in to change notification settings - Fork 429
Fix high CPU usage from busy-wait in TransferEngineOperationState::wait_for_completion #1036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…completion Replace busy-wait polling with adaptive exponential backoff using condition variable: - Start with 1ms backoff, increase to max 100ms with 1.5x multiplier - Use cv_.wait_for() to sleep between polls, reducing CPU usage - Early wake-up on completion via notify_all() from set_result_internal() - Maintain existing 60-second timeout behavior - Fix eliminates sustained high CPU usage during long transfers Co-authored-by: stmatengss <11641725+stmatengss@users.noreply.github.com>
Co-authored-by: stmatengss <11641725+stmatengss@users.noreply.github.com>
Co-authored-by: stmatengss <11641725+stmatengss@users.noreply.github.com>
Would using milliseconds be too high for RDMA transmission? In my tests, with direct sleep, 1us and 10us, for a transmission with value_size=1MB, the difference in concurrency from 1 to 64 is not very noticeable, but CPU usage drops significantly; however, using 100us, latency increases significantly under low concurrency (such as 1) . In my test environment, the preferred RDMA network card is 2*400Gbps (MC_MS_AUTO_DISC=1). Or could initialBackoff and MaxBackoff be made readable through environment variables? |
|
/gemini review it |
Sure. microseconds should be better. QQ: for 2*400Gbps environements, there are NV connectX NIC or other types? |
four NV connectX NICs, two as preferred and the other two as backups (with MC_MS_AUTO_DISC=1) I simply compared the latency and CPU usage under different sleep intervals (without any backoff mechanism). My test environment:
concurrency:1
concurrency:16
concurrency:64
'Average Used Cores' is calculated from Overall, when the CPU and RDMA network bandwidth are not heavily loaded, busy‑polling indeed provides lower latency. Of course, a more elegant solution would be event‑driven completion: when a batch or task finishes, the waiter is notified via a condition variable. I noticed the V3 roadmap ([Draft] Mooncake Store V3 Roadmap). Perhaps this topic could also be considered as part of that plan? |
That's a good idea, I will add it to the Roadmap. If you have interest, how about taking this task? |
Sure, I’d be happy to give it a try. |
TransferEngineOperationState::wait_for_completion()performs tight busy-wait polling without sleep, pegging CPU cores during long transfers under RDMA latency or bandwidth saturation.Changes
cv_.wait_for()with exponential backoff: 1ms → 100ms (1.5x multiplier)cv_.notify_all()inset_result_internal()for early wake-upImpact
Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
esm.ubuntu.com/usr/lib/apt/methods/https(dns block)If you need me to access, download, or install something from one of these locations, you can either:
Original prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.