Skip to content

docs: Add parallel processing feature proposal for issue #8#308

Open
krrish175-byte wants to merge 3 commits intokubeedge:mainfrom
krrish175-byte:fix/issue-8
Open

docs: Add parallel processing feature proposal for issue #8#308
krrish175-byte wants to merge 3 commits intokubeedge:mainfrom
krrish175-byte:fix/issue-8

Conversation

@krrish175-byte
Copy link

@krrish175-byte krrish175-byte commented Jan 29, 2026

What type of PR is this?

/kind documentation
/kind proposal

What this PR does / why we need it:

This PR adds a comprehensive technical proposal for implementing parallel processing of test cases in Ianvs core, as requested in issue #8 and discussed in PR #308.

The proposal addresses the need to reduce benchmarking time when testing multiple parameter configurations or algorithms. Currently, Ianvs executes test cases serially, which can lead to excessive execution times (e.g., 5 test cases × 2 hours each = 10 hours total). This feature will enable concurrent execution across multiple CPU cores, significantly reducing total benchmarking time.

Proposal Contents:

  • Motivation & Problem Statement: Why parallel processing is essential for all Ianvs developers
  • Architecture Design: Detailed technical design showing integration with Ianvs core components
  • Backward Compatibility Analysis: Comprehensive demonstration that all existing examples continue to work without modification
  • Impact Assessment: Analysis of how this affects current running examples across all scenarios
  • Testing & Validation Strategy: Plan for validating all existing examples in both serial and parallel modes
  • Implementation Roadmap: Phased 4-week approach with clear milestones
  • Risk Assessment: Identified risks and mitigation strategies
  • Performance Analysis: Expected speedup metrics (3-6x for typical workloads)

Key Design Principles:

  1. Backward Compatible: Parallel execution is opt-in; default behavior remains serial
  2. Zero Breaking Changes: All existing examples and workflows continue to function unchanged
  3. Robust Error Handling: Failures in one test case don't crash the entire benchmarking job
  4. User Control: Flexible configuration via CLI flags and YAML configuration

Which issue(s) this PR fixes:

Related to #8
Related to #308

Special notes for reviewers:

This proposal is intended for community review and architectural discussion before proceeding with implementation, as requested by @MooreZheng in PR #308.

The proposal will be presented at the next KubeEdge SIG AI meeting for formal architectural review. Feedback on the following aspects is particularly welcome:

  • Overall architecture and design approach
  • Impact analysis completeness
  • Testing and validation strategy
  • Implementation timeline and phases
  • Any concerns about backward compatibility or existing examples

Next Steps:

  1. Community review of this proposal
  2. Presentation at KubeEdge SIG AI meeting (pending schedule)
  3. Revisions based on community feedback
  4. Approval for implementation
  5. Update PR docs: Add parallel processing feature proposal for issue #8 #308 with implementation based on approved design

Note: This is a documentation-only PR. The actual implementation will be submitted separately after proposal approval.

/cc @MooreZheng @hsj576 @Poorunga
/assign @MooreZheng

@kubeedge-bot kubeedge-bot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. kind/feature Categorizes issue or PR as related to a new feature. labels Jan 29, 2026
@kubeedge-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: krrish175-byte
To complete the pull request process, please assign moorezheng after the PR has been reviewed.
You can assign the PR to them by writing /assign @moorezheng in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist
Copy link

Summary of Changes

Hello @krrish175-byte, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Ianvs benchmarking system by introducing parallel processing capabilities for test case execution. Previously, test cases ran serially, which could lead to excessive execution times for complex or numerous tests. The changes enable concurrent execution across multiple CPU cores, drastically reducing overall benchmarking duration. This is achieved through new command-line interface options, corresponding configuration file parameters, and a refactored internal mechanism that leverages Python's multiprocessing to distribute workloads efficiently and robustly.

Highlights

  • Parallel Execution: Implemented parallel processing for test case execution using concurrent.futures.ProcessPoolExecutor, allowing multiple test cases to run concurrently.
  • CLI Arguments: Introduced new command-line flags: -p / --parallel to enable parallel execution and -w / --workers to specify the number of worker processes.
  • Configuration Support: Added support for parallel_execution (boolean) and num_workers (integer) parameters within the benchmarking job configuration files.
  • Code Refactoring: Extracted the core test case execution logic into a new top-level, picklable function run_testcase_func to ensure compatibility with Python's multiprocessing module.
  • Robust Error Handling: Enhanced error handling during parallel execution, ensuring that a failure in one test case does not halt the entire benchmarking job, with errors being logged and other cases continuing to process.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@kubeedge-bot kubeedge-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 29, 2026
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces parallel processing for test case execution, which is a great feature for improving performance on multi-core systems. The implementation using concurrent.futures.ProcessPoolExecutor is solid. I've made a few suggestions to improve robustness, efficiency, and maintainability. Specifically, I've pointed out a potential crash when determining the number of worker processes, suggested an optimization to reduce data transfer from worker processes, and recommended using the project's standard logger for consistency. Overall, this is a valuable addition.

if parallel:
# Determine number of workers
if workers is None:
workers = max(1, os.cpu_count() - 1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

os.cpu_count() can return None on some platforms, which would cause a TypeError when you try to subtract 1 from it. You should handle this case to prevent the program from crashing.

Suggested change
workers = max(1, os.cpu_count() - 1)
workers = max(1, (os.cpu_count() or 2) - 1)

Comment on lines +124 to +127
if k == "parallel_execution":
self.parallel = v
if k == "num_workers":
self.workers = v

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for parsing parallel execution parameters is a bit disconnected from the main configuration parsing structure. For better readability and maintainability, consider integrating it into the if/elif/else block above. This would make the flow of control clearer.

# Run training with config parameters
# Evaluate and collect metrics
res = testcase.run(workspace)
return {"status": "success", "config": testcase.algorithm.name, "results": res, "testcase": testcase}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Returning the full testcase object from the worker process is unnecessary and can be inefficient. The testcase object is already available in the main process through the future_to_testcase map. Removing it from the returned dictionary will reduce serialization/deserialization overhead.

Suggested change
return {"status": "success", "config": testcase.algorithm.name, "results": res, "testcase": testcase}
return {"status": "success", "config": testcase.algorithm.name, "results": res}

res = testcase.run(workspace)
return {"status": "success", "config": testcase.algorithm.name, "results": res, "testcase": testcase}
except Exception as e:
return {"status": "failed", "config": testcase.algorithm.name, "error": str(e), "testcase": testcase}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Returning the full testcase object from the worker process is unnecessary and can be inefficient. The testcase object is already available in the main process through the future_to_testcase map. Removing it from the returned dictionary will reduce serialization/deserialization overhead.

Suggested change
return {"status": "failed", "config": testcase.algorithm.name, "error": str(e), "testcase": testcase}
return {"status": "failed", "config": testcase.algorithm.name, "error": str(e)}

if workers is None:
workers = max(1, os.cpu_count() - 1)

print(f"Running {len(self.test_cases)} test cases on {workers} workers")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This uses print for logging, but the project seems to have a configured LOGGER (e.g., in core/cmd/benchmarking.py). For consistent and manageable logging, it's better to use the logger instance. This also applies to the print statements on lines 82 and 84. Consider using LOGGER.info, LOGGER.warning, or LOGGER.error as appropriate. You will need to import LOGGER from core.common.log.

Suggested change
print(f"Running {len(self.test_cases)} test cases on {workers} workers")
LOGGER.info(f"Running {len(self.test_cases)} test cases on {workers} workers")

@kubeedge-bot kubeedge-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 29, 2026
@krrish175-byte
Copy link
Author

Hi @MooreZheng, can you please provide updates on this pr?

@MooreZheng MooreZheng requested review from hsj576 and removed request for Poorunga February 2, 2026 01:21
@MooreZheng
Copy link
Collaborator

MooreZheng commented Feb 2, 2026

Hi @MooreZheng, can you please provide updates on this pr?

Welcome, Krrish. The work will be appreciated.

This is related to issue #8. Please note that parallel processing is an important feature in ianvs core code that will impact all ianvs developers' work, past, present, and future. For such important features, you need to provide a proposal for community reviewers to show how it would affect all current running examples. Then a formal presentation is needed to launch a review of architecture design in the KubeEdge SIG AI before getting to any implementation.

See a proposal example in https://github.com/kubeedge/ianvs/blob/main/docs/proposals/scenarios/GovDoc2Poster/GovDoc2Poster.md

@kubeedge-bot kubeedge-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 2, 2026
@krrish175-byte krrish175-byte changed the title feat: Add parallel processing for multiple test case execution feat: Proposal for parallel processing in Ianvs Feb 2, 2026
@krrish175-byte krrish175-byte changed the title feat: Proposal for parallel processing in Ianvs docs: Add parallel processing feature proposal for issue #8 Feb 2, 2026
Copy link
Collaborator

@MooreZheng MooreZheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the routine meeting, the parallel processing feature is critical for every developer of ianvs and could be the most important and challenging feature in 2026. We want to take this opportunity to thank @krrish175-byte for the hard work, comprehensive proposal, great presentation, and valuable discussion.

I have several comments in hand

  1. Code changed.
    The changed code analysis could be moved from the appendix to the content. Note that all codes are appended instead of being modified.

  2. Analyse possible worker OOM problem.
    We can provide default settings to users. But default settings also need to be researched, like by default, how many workers/ how large per worker would be the best. Too few workers makes slight improvement compared to the serial setting, while too many workers can produce extra overhead.
    In the future, we might also need a dynamic worker setting.

  3. Analyse parallel processing for the AI scheme.
    Joint inference would be the most simple one, by using data partition and map reduce technique - we can just combine the test result.
    But training is a different story - how to combing the training result (model parameters?). Parallel processing of lifelong learnig can be considered by its mult-model nature. Parallel processing of federated learning can be considered by it distributed-learning nature with local training and global aggregation (e.g., FebAvg). But incremental learning would be difficult: it needs to do parallel for one single model training - how would the parameters be divided and combined during training in this proposal
    We might need to discuss on a case-by-case basis.

@krrish175-byte
Copy link
Author

Thank you so much @MooreZheng for the positive feedback and constructive comments! I am thrilled that the proposal was well received in the routine meeting.

I will address each of your points:

1. Code Changes - Move from Appendix to Content

Action: I'll restructure the proposal to move the code analysis from Appendix A into the main Implementation Details section with more context and explanation.

Updated Structure:

  • Section 7 (Implementation Details) will include:
    • Detailed code changes with before/after comparisons
    • Explanation of why each file is modified
    • Emphasis that code is being appended (not modified)
    • Clear separation between "unchanged code" and "new code"

I will update this in the next revision.

2. Worker OOM Problem & Default Settings Research

I completely agree that default settings need empirical research. Here's my proposed approach:

Phase 1: Conservative Defaults

  • Use cpu_count - 1 as initial safe default
  • Add runtime memory monitoring with warnings

Phase 1.5: Systematic Research (Weeks 5-7)

I propose conducting systematic experiments to determine optimal defaults:

Experiment Design:

  1. Worker Count vs Performance

    • Test configurations: 1, 2, 4, 6, 8, 12, 16 workers
    • Workloads: All example benchmarks (PCB-AOI, Robot, LLM, etc.)
    • Measure: Speedup, CPU utilization, memory usage, swap activity
  2. Memory Profiling

    • Profile memory per test case for each example
    • Determine: optimal_workers = floor(available_RAM * 0.8 / memory_per_test)
    • Document memory requirements by workload type
  3. OOM Detection & Prevention

    • Implement memory monitoring in the code
    • Add warnings when workers * estimated_memory > available_RAM
    • Provide clear error messages with recommended worker reduction

Deliverables:

  • Research report: "Optimal Worker Configuration for Ianvs Benchmarks"
  • Decision matrix: Worker recommendations by workload type
  • Updated defaults based on empirical data
  • Documentation: "How to Choose Worker Count" guide

Timeline:

  • Week 5-6: Run systematic experiments
  • Week 7: Analyze results and document findings
  • Week 8: Update proposal and defaults based on research

Dynamic Worker Setting (Future Enhancement)

As you suggested, we could add dynamic worker adjustment:

# Monitor memory during execution
if memory_usage > threshold:
    reduce_workers()
    LOGGER.warning("Memory pressure detected, reducing workers to {new_count}")

One question Should this empirical research be completed before Phase 1 merge, or can we release with conservative defaults and conduct research immediately after?

3. Parallel Processing for AI Schemes

This is an excellent point about different learning paradigms. Let me address each:

Joint Inference (Simple - Covered)

Its already supported

  • Data partition + map-reduce works perfectly
  • Each worker processes different test samples
  • Results are combined (e.g., accuracy aggregation)
  • No special handling needed

Lifelong Learning (Multi-Model - Covered)

It is supported by multi-model nature

  • Each test case trains a different model variant
  • Models are independent (no parameter sharing during parallel execution)
  • Results combined after training
  • Works naturally with current design

Federated Learning (Distributed Training - Covered)

It is supported by distributed-learning nature

  • Each test case does local training
  • Global aggregation (e.g., FedAvg) happens after local training completes
  • The parallel processing is at the test case level, not within training
  • Current design supports this

Incremental Learning (Complex - Needs Discussion)

This is the challenging case

The Problem:
Incremental learning trains one model progressively across multiple rounds. Parallelizing this requires:

  • Dividing model parameters during training
  • Synchronizing gradients across workers
  • Combining parameter updates

Current Proposal Scope:
This proposal parallelizes test cases (different algorithm configurations), not intra-model training (parameter parallelism).

Example:

  • Covered: Testing 5 different incremental learning algorithms in parallel (each with its own model)
  • Not Covered: Parallelizing the training of a single incremental learning model across multiple workers

Discussion Needed:

We need to discuss this case-by-case. Here are the options:

Option 1: Keep Out of Scope for Phase 1

  • Current proposal: Parallel test cases, not parallel training
  • Incremental learning with single model -> runs serially within that test case
  • Users can still parallelize by testing multiple incremental learning configurations

Option 2: Add Intra-Training Parallelism (Major Feature)

  • This would require:
    • Model parameter partitioning
    • Gradient synchronization
    • Distributed training framework integration (e.g., PyTorch DDP, Horovod)
  • This is a much larger scope - probably a separate proposal

Option 3: Hybrid Approach

  • Phase 1: Test case parallelism (current proposal)
  • Phase 2: Research and propose intra-training parallelism for incremental learning
  • Document limitations for incremental learning in Phase 1

My Recommendation:

For Phase 1, I suggest:

  1. Focus on test case parallelism (different algorithm configs)
  2. Document in the proposal: "Intra-model parameter parallelism for incremental learning is out of scope for Phase 1"
  3. Add to Future Work: "Phase 3: Intra-Training Parallelism for Single-Model Scenarios"
  4. For incremental learning benchmarks: Users can parallelize by testing multiple incremental configurations (different hyperparameters, different algorithms)

I have some questions

  • Should intra-training parallelism be a requirement for Phase 1?
  • Or can we scope it as Phase 2/3 and focus Phase 1 on test case parallelism?
  • Should we schedule a specific discussion session about incremental learning parallelization?

Next Steps

  1. Immediate (This Week):

    • Update proposal with code changes moved to main content
    • Add detailed section on Worker OOM prevention
    • Add section clarifying AI scheme parallelization support and limitations
    • Add empirical research plan for optimal defaults
  2. Follow-up Discussion:

    • Schedule discussion on incremental learning parallelization approach
    • Determine if intra-training parallelism should be in Phase 1 or later
  3. Timeline:

    • Post updated proposal: By end of this week
    • Address any additional feedback: Week 2
    • Begin implementation: Week 3 (pending approval)

Thank you again for the thorough review and excellent feedback! This discussion is making the proposal much stronger.

Looking forward to your thoughts on the questions above, particularly around:

  1. Timing of empirical worker research (before or after Phase 1 merge?)
  2. Scope of incremental learning support (Phase 1 or separate proposal?)

I hope to see you all in the next meeting. Thanks a lot for the time and efforts. I am excited to work on this issue :)

Best regards,
Krrish

@hsj576
Copy link
Member

hsj576 commented Mar 5, 2026

Thanks for putting together this proposal.
One key point to emphasize: any changes to the core code must not break any existing examples. Since this feature touches the core execution pipeline (e.g., testcasecontroller.py, benchmarkingjob.py), please make sure all current examples can still run correctly without any modification — both in the default serial mode and when the parallel feature is opted in.

Copy link
Collaborator

@MooreZheng MooreZheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. As of the out-of-memory issue, the research proposal seems fine to me. The worker-count research finding will be helpful and we still need to do the research to find out default worker setting for at least one example. Then a initial research is needed to be done before merging the first example. We might also need a dynamic worker setting in the future.
  2. However, the current research for parallel processing for different paradigms is not sufficient even for a proposal. The content of shows lack of knowledge of current ianvs implementation. May leave the research to future work.
  • For joint inference, considering its one-model nature, then tensor partition and data partition could work even with map reduce technique.
  • For lifelong learning, deeper dive into ianvs is needed. Its multi-module and multi-model nature might suit pipeline partition and model partition well.
  1. Our reviewer is not yet convinced for its impact to current examples. Need to combine proposals to one file. and the combined proposal shall need to show the code revision consideration and justificaiton for the compatibility design. Besides, when implemenation, demos for example are needed to made the impact validation convinced for all members.
  2. Renew Lay-0 architecture by referring to below examples:
  1. We also see a DCO error. It usually means that you forgot your sign off for commits. See report in https://github.com/kubeedge/ianvs/pull/308/checks?check_run_id=64329516901

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature Categorizes issue or PR as related to a new feature. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants