Describe your performance question
Describe
TransferEngineOperationState::wait_for_completion() performs busy-waiting: it loops indefinitely and repeatedly calls check_task_status() without any wait/yield/backoff. Under high RDMA latency or bandwidth saturation, this can peg a CPU core and hurt overall throughput when CPU is constrained.
Impact
- Sustained high CPU usage by the waiting thread(s) during large/long transfers.
- Lower system throughput under CPU contention;
Real-world scenario: Offline inference, throughput-first
Network: Large data streaming easily saturates the RDMA NIC; under saturation or transient congestion, transfer completion latency increases, so the current tight polling keeps spinning for long periods.
CPU contention: Each waiting thread can peg a CPU core. This competes with CPU-heavy preprocessing stages (tokenization, chunking/sharding, mmap reads, decoding).
Resource efficiency: Spinning wastes CPU cycles that could be used for data preparation, further affecting subsequent GPU inference and leading to throughput degradation.
Before submitting a new issue...