[P2P] TCP support #593

YangZhou1997 · 2025-12-21T00:01:58Z

Description

Please include a summary of the changes and the related issue.

Fixes # (issue)

Type of Change

Bug fix
New feature
Documentation update

How Has This Been Tested?

Include any tests here.

Unit tests
Integration tests
Manual testing

Checklist

My code follows the style guidelines, e.g. format.sh.
I have run build_and_install.sh to verify compilation.
I have removed redundant variables and comments.
I have updated the documentation.
I have added tests.

YangZhou1997 · 2025-12-21T22:04:53Z

To build:

# Ignore the libuccl_p2p.so install error (caused by permission issues)
make -f Makefile.rocm -j USE_TCP=1 install

To test:

UCCL_P2P_LOG_LEVEL=INFO UCCL_P2P_TCP_THREADS=20 UCCL_P2P_TCP_IFNAME=enp49s0f0np0,enp49s0f1np1 \
torchrun --nnodes=2 --nproc_per_node=1 --node-rank=0 --master_addr=10.162.224.129 \
benchmarks/benchmark_uccl_readwrite.py --mode=write --sizes=1048576

UCCL_P2P_LOG_LEVEL=INFO UCCL_P2P_TCP_THREADS=20 UCCL_P2P_TCP_IFNAME=enp49s0f0np0,enp49s0f1np1 \
torchrun --nnodes=2 --nproc_per_node=1 --node-rank=1 --master_addr=10.162.224.129 \
benchmarks/benchmark_uccl_readwrite.py --mode=write --sizes=1048576

praveingk · 2025-12-22T08:24:27Z

Nice 👍🏽 This is great.

praveingk · 2025-12-23T03:55:15Z

@YangZhou1997 TCP/EFA support will need integrations into uccl_engine.cc as well, similar to TCP-X.

YangZhou1997 · 2025-12-23T18:17:10Z

So far, we keep the engine.h interface, so hopefully uccl_engine.cc will be compatible

YangZhou1997 · 2026-01-05T07:10:40Z

NCCL performance is actually extremely high:

NCCL_IB_DISABLE=1 NCCL_SOCKET_IFNAME=enp49s0f0np0,enp49s0f1np1 \
NCCL_NCHANNELS_PER_NET_PEER=4 NCCL_DEBUG=INFO \
torchrun --nnodes=2 --nproc_per_node=1 --node-rank=0 \
--master_addr=10.162.224.129 benchmarks/benchmark_nccl.py

[Client]     256 B :    0.05 Gbps |    0.01 GB/s
[Client]    1.0 KB :    0.23 Gbps |    0.03 GB/s
[Client]    4.0 KB :    0.98 Gbps |    0.12 GB/s
[Client]   16.0 KB :    4.04 Gbps |    0.51 GB/s
[Client]   64.0 KB :   13.63 Gbps |    1.70 GB/s
[Client]  256.0 KB :   31.55 Gbps |    3.94 GB/s
[Client]    1.0 MB :   51.77 Gbps |    6.47 GB/s
[Client]   10.0 MB :   49.13 Gbps |    6.14 GB/s
[Client]   16.0 MB :   40.15 Gbps |    5.02 GB/s
[Client]  100.0 MB :   40.76 Gbps |    5.10 GB/s

YangZhou1997 · 2026-01-09T07:32:19Z

I have been playing with TCP support over the break, and now my take is that it is gonna be hard to beat NCCL. There are two reasons:

NCCL seems to support multi-NIC on Ethernet: Do NCCL support multi-NIC on ethernet? NVIDIA/nccl#601 and my test results above
NCCL implements semi-persistent GPU kernels to pipeline GPU<->CPU data copy and CPU<->network data transfer, which is already optimal.

Because of this, I think we should support TCP in a similar way to TCPX by just layering on top of NCCL. Maybe that would need helps from @DanielDanyang @derekwin

Also cc @praveingk @zhongjiechen @MaoZiming

praveingk · 2026-01-09T08:23:31Z

@YangZhou1997 Broadly I Agree.
Few questions :

Do we plan to support multi-nic in future. It can be useful, and could be done at a higher level in engine.cc (by maintaining multiple endpoints and choosing one by Round-robin, etc)
https://blog.vllm.ai/2026/01/08/kv-offloading-connector.html has similar observations w.r.t. GPU-CPU transfers. We also compared this offloading connector to UCX's GPU<->CPU transfer and found that UCX is 4-5x slower (yet to dig deeper on the reason).
@YangZhou1997 Are we using cudaMemcpyAsync, since the offloading connector of vLLM observes it performs better than custom kernel for larger blocks.

YangZhou1997 added 2 commits December 21, 2025 00:01

[P2P] TCP support

80ef8af

being able to run, but poor performnace

f37f89a

YangZhou1997 added 8 commits December 21, 2025 22:34

make read/write/send/recv work

e4f6d6d

Merge branch 'main' into p2p_tcp

84ce175

fixing sender-side race

d2d726b

cleanup

716fad1

cleanup

55da618

fixing tcp recv

7ee10ab

fixing multi interface

5579f5d

using cudamallochost, still very slow

43be76e

fixing build error on cuda

fd81abc

resolve conflicts

8aee1c6

YangZhou1997 added 6 commits January 6, 2026 21:30

tune tcp parameter to get high perf

3c25e0c

nits

604a6c7

Merge branch 'main' of https://github.com/uccl-project/uccl into p2p_tcp

d950506

elimiate readymsg between send recv, but performance not much improved

ebdfe4a

cleanup comments

cf624fa

nits

e84d772

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[P2P] TCP support #593

[P2P] TCP support #593

YangZhou1997 commented Dec 21, 2025

Uh oh!

YangZhou1997 commented Dec 21, 2025 •

edited

Loading

Uh oh!

praveingk commented Dec 22, 2025

Uh oh!

praveingk commented Dec 23, 2025

Uh oh!

YangZhou1997 commented Dec 23, 2025

Uh oh!

YangZhou1997 commented Jan 5, 2026 •

edited

Loading

Uh oh!

YangZhou1997 commented Jan 9, 2026 •

edited

Loading

Uh oh!

praveingk commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[P2P] TCP support #593

Are you sure you want to change the base?

[P2P] TCP support #593

Conversation

YangZhou1997 commented Dec 21, 2025

Description

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

YangZhou1997 commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

praveingk commented Dec 22, 2025

Uh oh!

praveingk commented Dec 23, 2025

Uh oh!

YangZhou1997 commented Dec 23, 2025

Uh oh!

YangZhou1997 commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YangZhou1997 commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

praveingk commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

YangZhou1997 commented Dec 21, 2025 •

edited

Loading

YangZhou1997 commented Jan 5, 2026 •

edited

Loading

YangZhou1997 commented Jan 9, 2026 •

edited

Loading