Skip to content

Conversation

@YangZhou1997
Copy link
Member

Description

Please include a summary of the changes and the related issue.

Fixes # (issue)

Type of Change

  • Bug fix
  • New feature
  • Documentation update

How Has This Been Tested?

Include any tests here.

  • Unit tests
  • Integration tests
  • Manual testing

Checklist

  • My code follows the style guidelines, e.g. format.sh.
  • I have run build_and_install.sh to verify compilation.
  • I have removed redundant variables and comments.
  • I have updated the documentation.
  • I have added tests.

@YangZhou1997
Copy link
Member Author

YangZhou1997 commented Dec 21, 2025

To build:

# Ignore the libuccl_p2p.so install error (caused by permission issues)
make -f Makefile.rocm -j USE_TCP=1 install

To test:

UCCL_P2P_LOG_LEVEL=INFO UCCL_P2P_TCP_THREADS=20 UCCL_P2P_TCP_IFNAME=enp49s0f0np0,enp49s0f1np1 \
torchrun --nnodes=2 --nproc_per_node=1 --node-rank=0 --master_addr=10.162.224.129 \
benchmarks/benchmark_uccl_readwrite.py --mode=write --sizes=1048576

UCCL_P2P_LOG_LEVEL=INFO UCCL_P2P_TCP_THREADS=20 UCCL_P2P_TCP_IFNAME=enp49s0f0np0,enp49s0f1np1 \
torchrun --nnodes=2 --nproc_per_node=1 --node-rank=1 --master_addr=10.162.224.129 \
benchmarks/benchmark_uccl_readwrite.py --mode=write --sizes=1048576

@praveingk
Copy link
Collaborator

Nice 👍🏽 This is great.

@praveingk
Copy link
Collaborator

@YangZhou1997 TCP/EFA support will need integrations into uccl_engine.cc as well, similar to TCP-X.

@YangZhou1997
Copy link
Member Author

So far, we keep the engine.h interface, so hopefully uccl_engine.cc will be compatible

@YangZhou1997
Copy link
Member Author

YangZhou1997 commented Jan 5, 2026

NCCL performance is actually extremely high:

NCCL_IB_DISABLE=1 NCCL_SOCKET_IFNAME=enp49s0f0np0,enp49s0f1np1 \
NCCL_NCHANNELS_PER_NET_PEER=4 NCCL_DEBUG=INFO \
torchrun --nnodes=2 --nproc_per_node=1 --node-rank=0 \
--master_addr=10.162.224.129 benchmarks/benchmark_nccl.py

[Client]     256 B :    0.05 Gbps |    0.01 GB/s
[Client]    1.0 KB :    0.23 Gbps |    0.03 GB/s
[Client]    4.0 KB :    0.98 Gbps |    0.12 GB/s
[Client]   16.0 KB :    4.04 Gbps |    0.51 GB/s
[Client]   64.0 KB :   13.63 Gbps |    1.70 GB/s
[Client]  256.0 KB :   31.55 Gbps |    3.94 GB/s
[Client]    1.0 MB :   51.77 Gbps |    6.47 GB/s
[Client]   10.0 MB :   49.13 Gbps |    6.14 GB/s
[Client]   16.0 MB :   40.15 Gbps |    5.02 GB/s
[Client]  100.0 MB :   40.76 Gbps |    5.10 GB/s

@YangZhou1997
Copy link
Member Author

YangZhou1997 commented Jan 9, 2026

I have been playing with TCP support over the break, and now my take is that it is gonna be hard to beat NCCL. There are two reasons:

  1. NCCL seems to support multi-NIC on Ethernet: Do NCCL support multi-NIC on ethernet? NVIDIA/nccl#601 and my test results above
  2. NCCL implements semi-persistent GPU kernels to pipeline GPU<->CPU data copy and CPU<->network data transfer, which is already optimal.

Because of this, I think we should support TCP in a similar way to TCPX by just layering on top of NCCL. Maybe that would need helps from @DanielDanyang @derekwin

Also cc @praveingk @zhongjiechen @MaoZiming

@praveingk
Copy link
Collaborator

@YangZhou1997 Broadly I Agree.
Few questions :

  1. Do we plan to support multi-nic in future. It can be useful, and could be done at a higher level in engine.cc (by maintaining multiple endpoints and choosing one by Round-robin, etc)
  2. https://blog.vllm.ai/2026/01/08/kv-offloading-connector.html has similar observations w.r.t. GPU-CPU transfers. We also compared this offloading connector to UCX's GPU<->CPU transfer and found that UCX is 4-5x slower (yet to dig deeper on the reason).
    @YangZhou1997 Are we using cudaMemcpyAsync, since the offloading connector of vLLM observes it performs better than custom kernel for larger blocks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants