-
Notifications
You must be signed in to change notification settings - Fork 110
[P2P] Remove collective backend from P2P, only use P2P native RDMA implementation #624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
hi @zhongjie, we also have some lazy init functionality used by uccl_engine.h (from Nixl). Our IB and EFA would also want to support that. |
yeah! i have supported that with minor modification but need to run Nixl once to test |
| new NICEndpoint(local_gpu_idx_, INVALID_RANK_ID, 0, false)); | ||
| numa_node_ = 0; | ||
| #else | ||
| ep_ = new uccl::RDMAEndpoint(num_cpus_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhongjiechen Does the new NIC endpoint support initializing without gpu_idx for RDMA. This change was made to support NIXL, where the local_gpu_idx not passed in during initialization, but is discovered during memory registration, and then NIC is initialized accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah! I deliver INVALID_GPU to its constructor here:
Lines 108 to 110 in f814432
| // Initialize the RDMA endpoint with lazy creation. | |
| ep_ = std::shared_ptr<NICEndpoint>( | |
| new NICEndpoint(INVALID_GPU, INVALID_RANK_ID, 0, false)); |
Lines 24 to 35 in f814432
| if (gpu_index != INVALID_GPU) { | |
| std::vector<size_t> actual_device_ids; | |
| if (device_ids.size() == 0) { | |
| actual_device_ids = | |
| RdmaDeviceManager::instance().get_best_dev_idx(gpu_index); | |
| } else { | |
| actual_device_ids = device_ids; | |
| } | |
| initializeContexts(actual_device_ids); | |
| LOG(INFO) << "NICEndpoint initialized with " << contexts_.size() | |
| << " context(s) for GPU " << gpu_index; | |
| } |
And call initialize_engine here during memory registration:
Line 167 in f814432
| unified::initialize_engine_by_dev(ep_, local_gpu_idx_, false); |
With P2P native RDMA's
initialize_engine_by_dev:Lines 383 to 396 in f814432
| bool initialize_engine_by_dev(int gpu_index, bool enable_p2p_listen) { | |
| (void)enable_p2p_listen; | |
| gpu_index_ = gpu_index; | |
| std::vector<size_t> device_ids = | |
| RdmaDeviceManager::instance().get_best_dev_idx(gpu_index_); | |
| initializeContexts(device_ids); | |
| LOG(INFO) << "NICEndpoint initialized with " << contexts_.size() | |
| << " context(s) for GPU " << gpu_index_; | |
| return true; | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it 👍🏽 Thanks for clarifying.
Also, should we simplify the directory structure? for e.g move rdma, tcp, tcp-x inside transport\?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second you! It looks a bit messy right now... Let me reorganize them.
|
@zhongjiechen , quick thought: it would be great to get rid of the extra chunk in the engine.cc and engine.h, as I think they should be just done once in the underlying transport providers. |
Description
Please include a summary of the changes and the related issue.
Fixes # (issue)
Type of Change
How Has This Been Tested?
Include any tests here.
Checklist
format.sh.build_and_install.shto verify compilation.