Fix high CPU usage from busy-wait in TransferEngineOperationState::wait_for_completion #1036

Copilot · 2025-11-09T08:03:01Z

TransferEngineOperationState::wait_for_completion() performs tight busy-wait polling without sleep, pegging CPU cores during long transfers under RDMA latency or bandwidth saturation.

Changes

Replace busy-wait with adaptive exponential backoff
- Use cv_.wait_for() with exponential backoff: 1ms → 100ms (1.5x multiplier)
- Leverages existing cv_.notify_all() in set_result_internal() for early wake-up
- Maintains 60s timeout behavior

// Before: tight loop spinning ~4M times per 100ms
while (true) {
    std::unique_lock<std::mutex> lock(mutex_);
    check_task_status();
    if (result_.has_value()) break;
    // No sleep - continuous polling
}

// After: exponential backoff with condition variable
auto current_backoff = std::chrono::milliseconds(1);
while (true) {
    {
        std::unique_lock<std::mutex> lock(mutex_);
        check_task_status();
        if (result_.has_value()) return;
        
        if (cv_.wait_for(lock, current_backoff, 
                         [this] { return result_.has_value(); })) {
            return;
        }
    }
    current_backoff = std::min(
        std::ceil(current_backoff.count() * 1.5), 100ms);
}

Impact

457,416× reduction in polling operations (4.1M → 9 polls per 100ms)
Eliminates CPU contention for throughput-first workloads
Fast initial response (1ms) with graceful backoff for long operations

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

esm.ubuntu.com
- Triggering command: /usr/lib/apt/methods/https (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Original prompt

This section details on the original issue you should resolve

<issue_title>[Performance]: High CPU usage due to busy-wait in TransferEngineOperationState::wait_for_completion</issue_title>
<issue_description>### Describe your performance question

Describe

TransferEngineOperationState::wait_for_completion() performs busy-waiting: it loops indefinitely and repeatedly calls check_task_status() without any wait/yield/backoff. Under high RDMA latency or bandwidth saturation, this can peg a CPU core and hurt overall throughput when CPU is constrained.

Impact

Sustained high CPU usage by the waiting thread(s) during large/long transfers.

Lower system throughput under CPU contention;

Real-world scenario: Offline inference, throughput-first

Network: Large data streaming easily saturates the RDMA NIC; under saturation or transient congestion, transfer completion latency increases, so the current tight polling keeps spinning for long periods.

CPU contention: Each waiting thread can peg a CPU core. This competes with CPU-heavy preprocessing stages (tokenization, chunking/sharding, mmap reads, decoding).

Resource efficiency: Spinning wastes CPU cycles that could be used for data preparation, further affecting subsequent GPU inference and leading to throughput degradation.

Before submitting a new issue...

Make sure you already searched for relevant issues and read the documentation</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes [Performance]: High CPU usage due to busy-wait in TransferEngineOperationState::wait_for_completion #1033

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

…completion Replace busy-wait polling with adaptive exponential backoff using condition variable: - Start with 1ms backoff, increase to max 100ms with 1.5x multiplier - Use cv_.wait_for() to sleep between polls, reducing CPU usage - Early wake-up on completion via notify_all() from set_result_internal() - Maintain existing 60-second timeout behavior - Fix eliminates sustained high CPU usage during long transfers Co-authored-by: stmatengss <11641725+stmatengss@users.noreply.github.com>

Co-authored-by: stmatengss <11641725+stmatengss@users.noreply.github.com>

wwq2333 · 2025-11-10T03:09:15Z

    constexpr auto kInitialBackoffMs = std::chrono::milliseconds(1);
    constexpr auto kMaxBackoffMs = std::chrono::milliseconds(100);

Would using milliseconds be too high for RDMA transmission?

In my tests, with direct sleep, 1us and 10us, for a transmission with value_size=1MB, the difference in concurrency from 1 to 64 is not very noticeable, but CPU usage drops significantly; however, using 100us, latency increases significantly under low concurrency (such as 1) .

In my test environment, the preferred RDMA network card is 2*400Gbps (MC_MS_AUTO_DISC=1).

Or could initialBackoff and MaxBackoff be made readable through environment variables?

stmatengss · 2025-11-10T03:22:27Z

/gemini review it

stmatengss · 2025-11-10T05:43:06Z

    constexpr auto kInitialBackoffMs = std::chrono::milliseconds(1);
    constexpr auto kMaxBackoffMs = std::chrono::milliseconds(100);
Would using milliseconds be too high for RDMA transmission?

In my tests, with direct sleep, 1us and 10us, for a transmission with value_size=1MB, the difference in concurrency from 1 to 64 is not very noticeable, but CPU usage drops significantly; however, using 100us, latency increases significantly under low concurrency (such as 1) .

In my test environment, the preferred RDMA network card is 2*400Gbps (MC_MS_AUTO_DISC=1).

Or could initialBackoff and MaxBackoff be made readable through environment variables?

Sure. microseconds should be better. QQ: for 2*400Gbps environements, there are NV connectX NIC or other types?

wwq2333 · 2025-11-10T08:25:51Z

Sure. microseconds should be better. QQ: for 2*400Gbps environements, there are NV connectX NIC or other types?

four NV connectX NICs, two as preferred and the other two as backups (with MC_MS_AUTO_DISC=1)

I simply compared the latency and CPU usage under different sleep intervals (without any backoff mechanism).

My test environment:

Client pod: 32 cores, 128 GB RAM, 4 RDMA NICs; running only the mooncake-store client with multiple threads performing get operations on 1 MB values.
Worker pod: 32 cores, 128 GB RAM, 4 RDMA NICs; deployed with both master and worker components (a client that only performing sleep operations).

concurrency:1

Sleep Interval	Average Used Cores	Throughput (GB/s)	Average Latency (µs)
0 ns	4.69	8.55	113.65
100 ns	4.32	7.08	137.49
1,000 ns	4.29	6.54	148.87
10,000 ns	4.27	6.48	150.21

concurrency:16

Sleep Interval	Average Used Cores	Throughput (GB/s)	Average Latency (µs)
0 ns	10.23	70.08	222.10
100 ns	6.35	61.23	253.34
1,000 ns	6.13	60.49	256.74
10,000 ns	6.05	58.00	268.00

concurrency:64

Sleep Interval	Average Used Cores	Throughput (GB/s)	Average Latency (µs)
0 ns	25.02	45.43	1374.30
100 ns	8.04	84.62	737.03
1,000 ns	7.97	86.54	720.73
10,000 ns	7.90	86.66	719.69

Note: Under 64‑concurrency testing, the results for 1,000 ns and 10,000 ns sleep intervals are practically the same.
Each run shows slight variance, with either one occasionally performing slightly better.

'Average Used Cores' is calculated from utime and stime in /proc/self/stat, (user + system CPU time difference) / wall clock time.
Peak CPU usage observed in top is slightly higher than these averages; for the 64‑concurrency case without sleep, the peak usage reached over 30 cores.

Overall, when the CPU and RDMA network bandwidth are not heavily loaded, busy‑polling indeed provides lower latency.
So it seems better to keep busy‑polling as the default behavior, which keeps the user experience the same as before.
Meanwhile, the configuration could be made adjustable through environment variables, allowing users to choose a short sleep approach in certain scenarios.

Of course, a more elegant solution would be event‑driven completion: when a batch or task finishes, the waiter is notified via a condition variable.
That said, having the transfer engine call into the store directly feels a bit awkward and introduces unnecessary coupling between layers.

I noticed the V3 roadmap ([Draft] Mooncake Store V3 Roadmap). Perhaps this topic could also be considered as part of that plan?

stmatengss · 2025-11-10T09:04:59Z

Sure. microseconds should be better. QQ: for 2*400Gbps environements, there are NV connectX NIC or other types?

four NV connectX NICs, two as preferred and the other two as backups (with MC_MS_AUTO_DISC=1)

I simply compared the latency and CPU usage under different sleep intervals (without any backoff mechanism).

My test environment:

Client pod: 32 cores, 128 GB RAM, 4 RDMA NICs; running only the mooncake-store client with multiple threads performing get operations on 1 MB values.

Worker pod: 32 cores, 128 GB RAM, 4 RDMA NICs; deployed with both master and worker components (a client that only performing sleep operations).

concurrency:1

Sleep Interval Average Used Cores Throughput (GB/s) Average Latency (µs)
0 ns 4.69 8.55 113.65
100 ns 4.32 7.08 137.49
1,000 ns 4.29 6.54 148.87
10,000 ns 4.27 6.48 150.21
concurrency:16

Sleep Interval Average Used Cores Throughput (GB/s) Average Latency (µs)
0 ns 10.23 70.08 222.10
100 ns 6.35 61.23 253.34
1,000 ns 6.13 60.49 256.74
10,000 ns 6.05 58.00 268.00
concurrency:64

Sleep Interval Average Used Cores Throughput (GB/s) Average Latency (µs)
0 ns 25.02 45.43 1374.30
100 ns 8.04 84.62 737.03
1,000 ns 7.97 86.54 720.73
10,000 ns 7.90 86.66 719.69

Note: Under 64‑concurrency testing, the results for 1,000 ns and 10,000 ns sleep intervals are practically the same.
Each run shows slight variance, with either one occasionally performing slightly better.

'Average Used Cores' is calculated from utime and stime in /proc/self/stat, (user + system CPU time difference) / wall clock time. Peak CPU usage observed in top is slightly higher than these averages; for the 64‑concurrency case without sleep, the peak usage reached over 30 cores.

Overall, when the CPU and RDMA network bandwidth are not heavily loaded, busy‑polling indeed provides lower latency. So it seems better to keep busy‑polling as the default behavior, which keeps the user experience the same as before. Meanwhile, the configuration could be made adjustable through environment variables, allowing users to choose a short sleep approach in certain scenarios.

Of course, a more elegant solution would be event‑driven completion: when a batch or task finishes, the waiter is notified via a condition variable. That said, having the transfer engine call into the store directly feels a bit awkward and introduces unnecessary coupling between layers.

I noticed the V3 roadmap ([Draft] Mooncake Store V3 Roadmap). Perhaps this topic could also be considered as part of that plan?

That's a good idea, I will add it to the Roadmap. If you have interest, how about taking this task?

wwq2333 · 2025-11-11T01:30:48Z

That's a good idea, I will add it to the Roadmap. If you have interest, how about taking this task?

Sure, I’d be happy to give it a try.

Initial plan

d554a0d

Copilot AI assigned Copilot and stmatengss Nov 9, 2025

Copilot started work on behalf of stmatengss November 9, 2025 08:03 View session

Copilot AI and others added 3 commits November 9, 2025 08:12

Add demonstration validating the fix reduces CPU usage

9932749

Co-authored-by: stmatengss <11641725+stmatengss@users.noreply.github.com>

Remove CodeQL artifact and add to gitignore

266d696

Co-authored-by: stmatengss <11641725+stmatengss@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix high CPU usage due to busy-wait in TransferEngineOperationState~~ Fix high CPU usage from busy-wait in TransferEngineOperationState::wait_for_completion Nov 9, 2025

Copilot AI requested a review from stmatengss November 9, 2025 08:17

Copilot finished work on behalf of stmatengss November 9, 2025 08:17

stmatengss marked this pull request as ready for review November 10, 2025 02:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix high CPU usage from busy-wait in TransferEngineOperationState::wait_for_completion #1036

Fix high CPU usage from busy-wait in TransferEngineOperationState::wait_for_completion #1036

Uh oh!

Copilot AI commented Nov 9, 2025 •

edited

Loading

Uh oh!

wwq2333 commented Nov 10, 2025

Uh oh!

stmatengss commented Nov 10, 2025 •

edited

Loading

Uh oh!

stmatengss commented Nov 10, 2025

Uh oh!

wwq2333 commented Nov 10, 2025 •

edited

Loading

Uh oh!

stmatengss commented Nov 10, 2025

Uh oh!

wwq2333 commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix high CPU usage from busy-wait in TransferEngineOperationState::wait_for_completion #1036

Are you sure you want to change the base?

Fix high CPU usage from busy-wait in TransferEngineOperationState::wait_for_completion #1036

Uh oh!

Conversation

Copilot AI commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Impact

I tried to connect to the following addresses, but was blocked by firewall rules:

Describe

Impact

Real-world scenario: Offline inference, throughput-first

Before submitting a new issue...

Comments on the Issue (you are @copilot in this section)

Uh oh!

wwq2333 commented Nov 10, 2025

Uh oh!

stmatengss commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stmatengss commented Nov 10, 2025

Uh oh!

wwq2333 commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stmatengss commented Nov 10, 2025

Uh oh!

wwq2333 commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Nov 9, 2025 •

edited

Loading

stmatengss commented Nov 10, 2025 •

edited

Loading

wwq2333 commented Nov 10, 2025 •

edited

Loading