Skip to content

[Performance]: High CPU usage due to busy-wait in TransferEngineOperationState::wait_for_completion #1033

@wwq2333

Description

@wwq2333

Describe your performance question

Describe

TransferEngineOperationState::wait_for_completion() performs busy-waiting: it loops indefinitely and repeatedly calls check_task_status() without any wait/yield/backoff. Under high RDMA latency or bandwidth saturation, this can peg a CPU core and hurt overall throughput when CPU is constrained.

Impact

  • Sustained high CPU usage by the waiting thread(s) during large/long transfers.
  • Lower system throughput under CPU contention;

Real-world scenario: Offline inference, throughput-first

Network: Large data streaming easily saturates the RDMA NIC; under saturation or transient congestion, transfer completion latency increases, so the current tight polling keeps spinning for long periods.

CPU contention: Each waiting thread can peg a CPU core. This competes with CPU-heavy preprocessing stages (tokenization, chunking/sharding, mmap reads, decoding).

Resource efficiency: Spinning wastes CPU cycles that could be used for data preparation, further affecting subsequent GPU inference and leading to throughput degradation.

Before submitting a new issue...

  • Make sure you already searched for relevant issues and read the documentation

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions