[P2P] Remove collective backend from P2P, only use P2P native RDMA implementation #624

zhongjiechen · 2026-01-07T15:57:55Z

Description

Please include a summary of the changes and the related issue.

Fixes # (issue)

Type of Change

Bug fix
New feature
Documentation update

How Has This Been Tested?

Include any tests here.

Unit tests
Integration tests
Manual testing

Checklist

My code follows the style guidelines, e.g. format.sh.
I have run build_and_install.sh to verify compilation.
I have removed redundant variables and comments.
I have updated the documentation.
I have added tests.

YangZhou1997 · 2026-01-07T18:37:22Z

hi @zhongjie, we also have some lazy init functionality used by uccl_engine.h (from Nixl). Our IB and EFA would also want to support that.

zhongjiechen · 2026-01-07T19:05:14Z

hi @zhongjie, we also have some lazy init functionality used by uccl_engine.h (from Nixl). Our IB and EFA would also want to support that.

yeah! i have supported that with minor modification but need to run Nixl once to test

praveingk · 2026-01-08T03:45:17Z

p2p/engine.cc

      new NICEndpoint(local_gpu_idx_, INVALID_RANK_ID, 0, false));
-  numa_node_ = 0;
-#else
-  ep_ = new uccl::RDMAEndpoint(num_cpus_);


@zhongjiechen Does the new NIC endpoint support initializing without gpu_idx for RDMA. This change was made to support NIXL, where the local_gpu_idx not passed in during initialization, but is discovered during memory registration, and then NIC is initialized accordingly.

Yeah! I deliver INVALID_GPU to its constructor here:

uccl/p2p/engine.cc

Lines 108 to 110 in f814432

// Initialize the RDMA endpoint with lazy creation.

ep_ = std::shared_ptr<NICEndpoint>(

new NICEndpoint(INVALID_GPU, INVALID_RANK_ID, 0, false));

uccl/p2p/rdma/rdma_endpoint.h

Lines 24 to 35 in f814432

if (gpu_index != INVALID_GPU) {

std::vector<size_t> actual_device_ids;

if (device_ids.size() == 0) {

actual_device_ids =

RdmaDeviceManager::instance().get_best_dev_idx(gpu_index);

} else {

actual_device_ids = device_ids;

}

initializeContexts(actual_device_ids);

LOG(INFO) << "NICEndpoint initialized with " << contexts_.size()

<< " context(s) for GPU " << gpu_index;

}

And call initialize_engine here during memory registration:

uccl/p2p/engine.cc

Line 167 in f814432

unified::initialize_engine_by_dev(ep_, local_gpu_idx_, false);

With P2P native RDMA's initialize_engine_by_dev:

uccl/p2p/rdma/rdma_endpoint.h

Lines 383 to 396 in f814432

bool initialize_engine_by_dev(int gpu_index, bool enable_p2p_listen) {

(void)enable_p2p_listen;

gpu_index_ = gpu_index;

std::vector<size_t> device_ids =

RdmaDeviceManager::instance().get_best_dev_idx(gpu_index_);

initializeContexts(device_ids);

LOG(INFO) << "NICEndpoint initialized with " << contexts_.size()

<< " context(s) for GPU " << gpu_index_;

return true;

}

Got it 👍🏽 Thanks for clarifying.
Also, should we simplify the directory structure? for e.g move rdma, tcp, tcp-x inside transport\?

Second you! It looks a bit messy right now... Let me reorganize them.

YangZhou1997 · 2026-01-08T05:11:27Z

@zhongjiechen , quick thought: it would be great to get rid of the extra chunk in the engine.cc and engine.h, as I think they should be just done once in the underlying transport providers.

zhongjiechen added 2 commits January 7, 2026 15:57

Support initialize_engine_by_dev for P2P native RDMA.

62b236c

Fix pinning thread to numa.

7fbabb1

zhongjiechen force-pushed the p2p-opt branch from 7c8ec7c to 7fbabb1 Compare January 7, 2026 16:29

zhongjiechen added 3 commits January 7, 2026 17:02

Remove UCCL_P2P_USE_NATIVE_RDMA.

4b2f054

Fix build.sh to use USE_IB=1.

3c0cef3

Switch USE_IB to USE_EFA to pass build-on-gpu-host.

ae79694

zhongjiechen force-pushed the p2p-opt branch from b63e6f2 to ae79694 Compare January 7, 2026 17:27

Remove the compiler dependency on collective.

f814432

praveingk reviewed Jan 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[P2P] Remove collective backend from P2P, only use P2P native RDMA implementation #624

[P2P] Remove collective backend from P2P, only use P2P native RDMA implementation #624

Uh oh!

zhongjiechen commented Jan 7, 2026

Uh oh!

YangZhou1997 commented Jan 7, 2026

Uh oh!

zhongjiechen commented Jan 7, 2026 •

edited

Loading

Uh oh!

praveingk Jan 8, 2026

Uh oh!

zhongjiechen Jan 8, 2026 •

edited

Loading

Uh oh!

praveingk Jan 8, 2026

Uh oh!

zhongjiechen Jan 8, 2026

Uh oh!

YangZhou1997 commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	// Initialize the RDMA endpoint with lazy creation.
	ep_ = std::shared_ptr<NICEndpoint>(
	new NICEndpoint(INVALID_GPU, INVALID_RANK_ID, 0, false));

	if (gpu_index != INVALID_GPU) {
	std::vector<size_t> actual_device_ids;
	if (device_ids.size() == 0) {
	actual_device_ids =
	RdmaDeviceManager::instance().get_best_dev_idx(gpu_index);
	} else {
	actual_device_ids = device_ids;
	}
	initializeContexts(actual_device_ids);
	LOG(INFO) << "NICEndpoint initialized with " << contexts_.size()
	<< " context(s) for GPU " << gpu_index;
	}

	bool initialize_engine_by_dev(int gpu_index, bool enable_p2p_listen) {
	(void)enable_p2p_listen;

	gpu_index_ = gpu_index;

	std::vector<size_t> device_ids =
	RdmaDeviceManager::instance().get_best_dev_idx(gpu_index_);

	initializeContexts(device_ids);
	LOG(INFO) << "NICEndpoint initialized with " << contexts_.size()
	<< " context(s) for GPU " << gpu_index_;

	return true;
	}

[P2P] Remove collective backend from P2P, only use P2P native RDMA implementation #624

Are you sure you want to change the base?

[P2P] Remove collective backend from P2P, only use P2P native RDMA implementation #624

Uh oh!

Conversation

zhongjiechen commented Jan 7, 2026

Description

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

YangZhou1997 commented Jan 7, 2026

Uh oh!

zhongjiechen commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

praveingk Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

zhongjiechen Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

praveingk Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

zhongjiechen Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

YangZhou1997 commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhongjiechen commented Jan 7, 2026 •

edited

Loading

zhongjiechen Jan 8, 2026 •

edited

Loading