docs: Add parallel processing feature proposal for issue #8#308
docs: Add parallel processing feature proposal for issue #8#308krrish175-byte wants to merge 3 commits intokubeedge:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: krrish175-byte The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello @krrish175-byte, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Ianvs benchmarking system by introducing parallel processing capabilities for test case execution. Previously, test cases ran serially, which could lead to excessive execution times for complex or numerous tests. The changes enable concurrent execution across multiple CPU cores, drastically reducing overall benchmarking duration. This is achieved through new command-line interface options, corresponding configuration file parameters, and a refactored internal mechanism that leverages Python's multiprocessing to distribute workloads efficiently and robustly. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces parallel processing for test case execution, which is a great feature for improving performance on multi-core systems. The implementation using concurrent.futures.ProcessPoolExecutor is solid. I've made a few suggestions to improve robustness, efficiency, and maintainability. Specifically, I've pointed out a potential crash when determining the number of worker processes, suggested an optimization to reduce data transfer from worker processes, and recommended using the project's standard logger for consistency. Overall, this is a valuable addition.
| if parallel: | ||
| # Determine number of workers | ||
| if workers is None: | ||
| workers = max(1, os.cpu_count() - 1) |
There was a problem hiding this comment.
core/cmd/obj/benchmarkingjob.py
Outdated
| if k == "parallel_execution": | ||
| self.parallel = v | ||
| if k == "num_workers": | ||
| self.workers = v |
There was a problem hiding this comment.
| # Run training with config parameters | ||
| # Evaluate and collect metrics | ||
| res = testcase.run(workspace) | ||
| return {"status": "success", "config": testcase.algorithm.name, "results": res, "testcase": testcase} |
There was a problem hiding this comment.
Returning the full testcase object from the worker process is unnecessary and can be inefficient. The testcase object is already available in the main process through the future_to_testcase map. Removing it from the returned dictionary will reduce serialization/deserialization overhead.
| return {"status": "success", "config": testcase.algorithm.name, "results": res, "testcase": testcase} | |
| return {"status": "success", "config": testcase.algorithm.name, "results": res} |
| res = testcase.run(workspace) | ||
| return {"status": "success", "config": testcase.algorithm.name, "results": res, "testcase": testcase} | ||
| except Exception as e: | ||
| return {"status": "failed", "config": testcase.algorithm.name, "error": str(e), "testcase": testcase} |
There was a problem hiding this comment.
Returning the full testcase object from the worker process is unnecessary and can be inefficient. The testcase object is already available in the main process through the future_to_testcase map. Removing it from the returned dictionary will reduce serialization/deserialization overhead.
| return {"status": "failed", "config": testcase.algorithm.name, "error": str(e), "testcase": testcase} | |
| return {"status": "failed", "config": testcase.algorithm.name, "error": str(e)} |
| if workers is None: | ||
| workers = max(1, os.cpu_count() - 1) | ||
|
|
||
| print(f"Running {len(self.test_cases)} test cases on {workers} workers") |
There was a problem hiding this comment.
This uses print for logging, but the project seems to have a configured LOGGER (e.g., in core/cmd/benchmarking.py). For consistent and manageable logging, it's better to use the logger instance. This also applies to the print statements on lines 82 and 84. Consider using LOGGER.info, LOGGER.warning, or LOGGER.error as appropriate. You will need to import LOGGER from core.common.log.
| print(f"Running {len(self.test_cases)} test cases on {workers} workers") | |
| LOGGER.info(f"Running {len(self.test_cases)} test cases on {workers} workers") |
eace59c to
fe28f17
Compare
|
Hi @MooreZheng, can you please provide updates on this pr? |
Welcome, Krrish. The work will be appreciated. This is related to issue #8. Please note that parallel processing is an important feature in ianvs core code that will impact all ianvs developers' work, past, present, and future. For such important features, you need to provide a proposal for community reviewers to show how it would affect all current running examples. Then a formal presentation is needed to launch a review of architecture design in the KubeEdge SIG AI before getting to any implementation. See a proposal example in https://github.com/kubeedge/ianvs/blob/main/docs/proposals/scenarios/GovDoc2Poster/GovDoc2Poster.md |
da98e87 to
679c314
Compare
679c314 to
1297b27
Compare
There was a problem hiding this comment.
As discussed in the routine meeting, the parallel processing feature is critical for every developer of ianvs and could be the most important and challenging feature in 2026. We want to take this opportunity to thank @krrish175-byte for the hard work, comprehensive proposal, great presentation, and valuable discussion.
I have several comments in hand
-
Code changed.
The changed code analysis could be moved from the appendix to the content. Note that all codes are appended instead of being modified. -
Analyse possible worker OOM problem.
We can provide default settings to users. But default settings also need to be researched, like by default, how many workers/ how large per worker would be the best. Too few workers makes slight improvement compared to the serial setting, while too many workers can produce extra overhead.
In the future, we might also need a dynamic worker setting. -
Analyse parallel processing for the AI scheme.
Joint inference would be the most simple one, by using data partition and map reduce technique - we can just combine the test result.
But training is a different story - how to combing the training result (model parameters?). Parallel processing of lifelong learnig can be considered by its mult-model nature. Parallel processing of federated learning can be considered by it distributed-learning nature with local training and global aggregation (e.g., FebAvg). But incremental learning would be difficult: it needs to do parallel for one single model training - how would the parameters be divided and combined during training in this proposal
We might need to discuss on a case-by-case basis.
|
Thank you so much @MooreZheng for the positive feedback and constructive comments! I am thrilled that the proposal was well received in the routine meeting. I will address each of your points: 1. Code Changes - Move from Appendix to ContentAction: I'll restructure the proposal to move the code analysis from Appendix A into the main Implementation Details section with more context and explanation. Updated Structure:
I will update this in the next revision. 2. Worker OOM Problem & Default Settings ResearchI completely agree that default settings need empirical research. Here's my proposed approach: Phase 1: Conservative Defaults
Phase 1.5: Systematic Research (Weeks 5-7)I propose conducting systematic experiments to determine optimal defaults: Experiment Design:
Deliverables:
Timeline:
Dynamic Worker Setting (Future Enhancement)As you suggested, we could add dynamic worker adjustment: # Monitor memory during execution
if memory_usage > threshold:
reduce_workers()
LOGGER.warning("Memory pressure detected, reducing workers to {new_count}")One question Should this empirical research be completed before Phase 1 merge, or can we release with conservative defaults and conduct research immediately after? 3. Parallel Processing for AI SchemesThis is an excellent point about different learning paradigms. Let me address each: Joint Inference (Simple - Covered)Its already supported
Lifelong Learning (Multi-Model - Covered)It is supported by multi-model nature
Federated Learning (Distributed Training - Covered)It is supported by distributed-learning nature
Incremental Learning (Complex - Needs Discussion)This is the challenging case The Problem:
Current Proposal Scope: Example:
Discussion Needed: We need to discuss this case-by-case. Here are the options: Option 1: Keep Out of Scope for Phase 1
Option 2: Add Intra-Training Parallelism (Major Feature)
Option 3: Hybrid Approach
My Recommendation: For Phase 1, I suggest:
I have some questions
Next Steps
Thank you again for the thorough review and excellent feedback! This discussion is making the proposal much stronger. Looking forward to your thoughts on the questions above, particularly around:
I hope to see you all in the next meeting. Thanks a lot for the time and efforts. I am excited to work on this issue :) Best regards, |
… management and AI paradigms
|
Thanks for putting together this proposal. |
There was a problem hiding this comment.
- As of the out-of-memory issue, the research proposal seems fine to me. The worker-count research finding will be helpful and we still need to do the research to find out default worker setting for at least one example. Then a initial research is needed to be done before merging the first example. We might also need a dynamic worker setting in the future.
- However, the current research for parallel processing for different paradigms is not sufficient even for a proposal. The content of shows lack of knowledge of current ianvs implementation. May leave the research to future work.
- For joint inference, considering its one-model nature, then tensor partition and data partition could work even with map reduce technique.
- For lifelong learning, deeper dive into ianvs is needed. Its multi-module and multi-model nature might suit pipeline partition and model partition well.
- Our reviewer is not yet convinced for its impact to current examples. Need to combine proposals to one file. and the combined proposal shall need to show the code revision consideration and justificaiton for the compatibility design. Besides, when implemenation, demos for example are needed to made the impact validation convinced for all members.
- Renew Lay-0 architecture by referring to below examples:
- https://github.com/kubeedge/ianvs/blob/main/docs/proposals/scenarios/GovDoc2Poster/GovDoc2Poster.md
- https://github.com/kubeedge/ianvs/blob/main/docs/proposals/scenarios/phys-scene-gen/phys_scene_gen_proposal.md
- https://github.com/kubeedge/ianvs/blob/main/docs/proposals/scenarios/PIPL-Compliant%20Cloud-Edge%20Collaborative%20Privacy-Preserving%20Prompt%20Processing%20Framework/optimizing-privacy-performance-equilibrium-cloud-edge-llm-systems_en_PR.md
- We also see a DCO error. It usually means that you forgot your sign off for commits. See report in https://github.com/kubeedge/ianvs/pull/308/checks?check_run_id=64329516901
What type of PR is this?
/kind documentation
/kind proposal
What this PR does / why we need it:
This PR adds a comprehensive technical proposal for implementing parallel processing of test cases in Ianvs core, as requested in issue #8 and discussed in PR #308.
The proposal addresses the need to reduce benchmarking time when testing multiple parameter configurations or algorithms. Currently, Ianvs executes test cases serially, which can lead to excessive execution times (e.g., 5 test cases × 2 hours each = 10 hours total). This feature will enable concurrent execution across multiple CPU cores, significantly reducing total benchmarking time.
Proposal Contents:
Key Design Principles:
Which issue(s) this PR fixes:
Related to #8
Related to #308
Special notes for reviewers:
This proposal is intended for community review and architectural discussion before proceeding with implementation, as requested by @MooreZheng in PR #308.
The proposal will be presented at the next KubeEdge SIG AI meeting for formal architectural review. Feedback on the following aspects is particularly welcome:
Next Steps:
Note: This is a documentation-only PR. The actual implementation will be submitted separately after proposal approval.
/cc @MooreZheng @hsj576 @Poorunga
/assign @MooreZheng