Skip to content

Conversation

@zhongjiechen
Copy link
Member

Description

Please include a summary of the changes and the related issue.

Fixes # (issue)

Type of Change

  • Bug fix
  • New feature
  • Documentation update

How Has This Been Tested?

Include any tests here.

  • Unit tests
  • Integration tests
  • Manual testing

Checklist

  • My code follows the style guidelines, e.g. format.sh.
  • I have run build_and_install.sh to verify compilation.
  • I have removed redundant variables and comments.
  • I have updated the documentation.
  • I have added tests.

@YangZhou1997
Copy link
Member

hi @zhongjie, we also have some lazy init functionality used by uccl_engine.h (from Nixl). Our IB and EFA would also want to support that.

@zhongjiechen
Copy link
Member Author

zhongjiechen commented Jan 7, 2026

hi @zhongjie, we also have some lazy init functionality used by uccl_engine.h (from Nixl). Our IB and EFA would also want to support that.

yeah! i have supported that with minor modification but need to run Nixl once to test

new NICEndpoint(local_gpu_idx_, INVALID_RANK_ID, 0, false));
numa_node_ = 0;
#else
ep_ = new uccl::RDMAEndpoint(num_cpus_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhongjiechen Does the new NIC endpoint support initializing without gpu_idx for RDMA. This change was made to support NIXL, where the local_gpu_idx not passed in during initialization, but is discovered during memory registration, and then NIC is initialized accordingly.

Copy link
Member Author

@zhongjiechen zhongjiechen Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! I deliver INVALID_GPU to its constructor here:

uccl/p2p/engine.cc

Lines 108 to 110 in f814432

// Initialize the RDMA endpoint with lazy creation.
ep_ = std::shared_ptr<NICEndpoint>(
new NICEndpoint(INVALID_GPU, INVALID_RANK_ID, 0, false));

if (gpu_index != INVALID_GPU) {
std::vector<size_t> actual_device_ids;
if (device_ids.size() == 0) {
actual_device_ids =
RdmaDeviceManager::instance().get_best_dev_idx(gpu_index);
} else {
actual_device_ids = device_ids;
}
initializeContexts(actual_device_ids);
LOG(INFO) << "NICEndpoint initialized with " << contexts_.size()
<< " context(s) for GPU " << gpu_index;
}

And call initialize_engine here during memory registration:

unified::initialize_engine_by_dev(ep_, local_gpu_idx_, false);

With P2P native RDMA's initialize_engine_by_dev:
bool initialize_engine_by_dev(int gpu_index, bool enable_p2p_listen) {
(void)enable_p2p_listen;
gpu_index_ = gpu_index;
std::vector<size_t> device_ids =
RdmaDeviceManager::instance().get_best_dev_idx(gpu_index_);
initializeContexts(device_ids);
LOG(INFO) << "NICEndpoint initialized with " << contexts_.size()
<< " context(s) for GPU " << gpu_index_;
return true;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it 👍🏽 Thanks for clarifying.
Also, should we simplify the directory structure? for e.g move rdma, tcp, tcp-x inside transport\?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second you! It looks a bit messy right now... Let me reorganize them.

@YangZhou1997
Copy link
Member

@zhongjiechen , quick thought: it would be great to get rid of the extra chunk in the engine.cc and engine.h, as I think they should be just done once in the underlying transport providers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants