Skip to content

[feature] [sub feature 2] Dependency for qwen image edit run#151

Open
SJTUyh wants to merge 4 commits intoAISBench:masterfrom
SJTUyh:dep_edit
Open

[feature] [sub feature 2] Dependency for qwen image edit run#151
SJTUyh wants to merge 4 commits intoAISBench:masterfrom
SJTUyh:dep_edit

Conversation

@SJTUyh
Copy link
Collaborator

@SJTUyh SJTUyh commented Feb 13, 2026

Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
感谢您的贡献,我们非常重视。以下说明将使您的拉取请求更健康,更易于获得反馈。如果您不理解某些项目,请不要担心,只需提交拉取请求并从维护人员那里寻求帮助即可。

PR Type / PR类型

  • Feature(功能新增)
  • Bugfix(Bug 修复)
  • Docs(文档更新)
  • CI/CD(持续集成/持续部署)
  • Refactor(代码重构)
  • Perf(性能优化)
  • Dependency(依赖项更新)
  • Test-Cases(测试用例更新)
  • Other(其他)

Related Issue | 关联 Issue
Fixes #(issue ID / issue 编号) / Relates to #(issue ID / issue 编号)

🔍 Motivation / 变更动机

Please describe the motivation of this PR and the goal you want to achieve through this PR.
请描述您的拉取请求的动机和您希望通过此拉取请求实现的目标。

📝 Modification / 修改内容

Please briefly describe what modification is made in this PR.
请简要描述此拉取请求中进行的修改。

📐 Associated Test Results / 关联测试结果

Please provide links to the related test results, such as CI pipelines, test reports, etc.
请提供相关测试结果的链接,例如 CI 管道、测试报告等。

⚠️ BC-breaking (Optional) / 向后不兼容变更(可选)

Does the modification introduce changes that break the backward compatibility of the downstream repositories? If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
是否引入了会破坏下游存储库向后兼容性的更改?如果是,请描述它如何破坏兼容性,以及下游项目应该如何修改其代码以保持与此 PR 的兼容性。

⚠️ Performance degradation (Optional) / 性能下降(可选)

If the modification introduces performance degradation, please describe the impact of the performance degradation and the expected performance improvement.
如果引入了性能下降,请描述性能下降的影响和预期的性能改进。

🌟 Use cases (Optional) / 使用案例(可选)

If this PR introduces a new feature, it is better to list some use cases here and update the documentation.
如果此拉取请求引入了新功能,最好在此处列出一些用例并更新文档。

✅ Checklist / 检查列表

Before PR:

  • Pre-commit or other linting tools are used to fix the potential lint issues. / 使用预提交或其他 linting 工具来修复潜在的 lint 问题。
  • Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖,导致 Bug 的情况应在单元测试中添加。
  • The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是,请添加更多单元测试以确保正确性。
  • All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档(API 文档、文档字符串、示例教程)已更新以反映这些更改。

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects. / 如果此拉取请求对下游或其他相关项目有潜在影响,应在那些项目中测试此 PR。
  • CLA has been signed and all committers have signed the CLA in this PR. / CLA 已签署,且本 PR 中的所有提交者均已签署 CLA。

👥 Collaboration Info / 协作信息

  • Suggested Reviewers / 建议审核人: @xxx
  • Relevant Module Owners / 相关模块负责人: @xxx
  • Other Collaboration Notes / 其他协作说明:

🌟 Useful CI Command / 实用的CI命令

Command / 命令 Introduction / 介绍
/gemini review Performs a code review for the current pull request in its current state by Gemini. / 对当前拉取请求在当前状态下由 Gemini 执行代码审核。
/gemini summary Provides a summary of the current pull request in its current state by Gemini. / 对当前拉取请求在当前状态下由 Gemini 提供摘要。
/gemini help Displays a list of available commands of Gemini. / 显示 Gemini 可用命令的列表。
/readthedocs build Triggers a build of the documentation for the current pull request in its current state by Read the Docs. / 触发当前拉取请求在当前状态下由 Read the Docs 构建文档。

@gemini-code-assist
Copy link

Summary of Changes

Hello @SJTUyh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new feature that enables the execution of Qwen image editing models within the benchmark system, complete with advanced LLM-based judging capabilities. It establishes a sophisticated distributed parallelism framework, optimized for NPU devices, to efficiently run and evaluate these complex models. The changes also include foundational components for handling judge-specific datasets and improving overall data I/O performance.

Highlights

  • LLM Judging Framework: Introduced a comprehensive framework for LLM-based judging, including a new JudgeInfer worker, dedicated dataset types (BaseJDGDataset, LLMJudgeDataset), and an evaluator (LLMJudgeCorrectEvaluator). This enables automated evaluation of model predictions using another LLM.
  • Qwen Image Edit Model Integration: Integrated the Qwen Image Edit model pipeline, including its custom transformer (QwenImageTransformer2DModel) and scheduler (FlowMatchEulerDiscreteScheduler), designed for image editing tasks. This pipeline supports dual-stream attention for joint image and text processing.
  • Distributed Parallelism for Qwen Image Edit: Implemented a robust distributed parallelism infrastructure specifically for the Qwen Image Edit model, encompassing tensor parallelism, sequence parallelism (Ulysses and ring groups), and classifier-free guidance parallelism. This allows for efficient scaling of the Qwen Image Edit model across multiple devices.
  • NPU Optimization: Incorporated NPU-specific optimizations for key operations within the Qwen Image Edit transformer, such as rotary positional embeddings (npu_rotary_mul) and adaptive layer normalization (npu_layer_norm_eval), to enhance performance on Ascend hardware.
  • Efficient Data Handling: Added utility functions for fast reading and writing of JSONL files using mmap and orjson, improving data processing efficiency for large datasets.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • ais_bench/benchmark/cli/workers.py
    • Added os, shutil, PredictionInvalidException, TMAN_CODES, load_jsonl, and dump_jsonl imports.
    • Introduced a new JudgeInfer worker class to manage inference tasks for judge models, including configuration, task partitioning, dataset merging, and result post-processing.
    • Updated the Eval class to incorporate judge_infer_cfg for proper result handling and dataset abbreviation.
    • Modified the Summarize class to preprocess dataset configurations that include judge_infer_cfg.
    • Integrated the JudgeInfer worker into the WORK_FLOW for both all and eval operational modes.
  • ais_bench/benchmark/configs/datasets/aime2025/aime2025_gen_0_shot_llmjudge.py
    • Added a new dataset configuration file for the AIME2025 benchmark, specifically tailored for LLM judging with zero-shot generation.
    • Defined aime2025_reader_cfg, aime2025_infer_cfg, GRADER_TEMPLATE, aime2025_judge_infer_cfg, and aime2025_eval_cfg to support the judging process.
    • Configured judge_infer_cfg to utilize VLLMCustomAPIChat as the judge model and Aime2025JDGDataset for judge-specific dataset handling.
  • ais_bench/benchmark/datasets/aime2025.py
    • Imported LLMJudgeDataset to support LLM judging functionalities.
    • Added Aime2025JDGDataset class, which inherits from LLMJudgeDataset, to provide specific dataset loading and processing for AIME2025 judging tasks.
  • ais_bench/benchmark/datasets/base.py
    • Imported Type for type hinting.
    • Introduced BaseJDGDataset as an abstract base class for judge datasets, defining common methods for loading predictions and initializing original dataset instances.
    • Updated the load method signature in BaseDataset to correctly accept self.
  • ais_bench/benchmark/datasets/utils/datasets.py
    • Removed trailing whitespace in docstrings for get_sample_data and anls_compute functions.
  • ais_bench/benchmark/datasets/utils/llm_judge.py
    • Added a new utility file for LLM judging functionalities.
    • Implemented get_a_or_b as a text postprocessor to extract 'A' or 'B' from model predictions.
    • Defined LLMJudgeDataset class, extending BaseJDGDataset, for handling datasets specifically formatted for LLM judging.
    • Introduced LLMJudgeCorrectEvaluator to calculate scores based on LLM judge predictions.
  • ais_bench/benchmark/models/local_models/qwenimage_edit/attn_layer.py
    • Added a new file defining xFuserLongContextAttention_new4, a custom attention layer for distributed Qwen image edit models.
    • Extended LongContextAttention to integrate SeqAllToAll4D for sequence parallelism, enabling efficient data distribution.
    • Implemented logic for joint attention across text and image streams, including splitting and gathering outputs across distributed ranks.
  • ais_bench/benchmark/models/local_models/qwenimage_edit/distributed/all_to_all.py
    • Added a new file containing all_to_all_4D and SeqAllToAll4D functions.
    • Provided mechanisms for efficient 4D tensor data exchange (scatter/gather) across distributed processes, essential for sequence parallelism.
  • ais_bench/benchmark/models/local_models/qwenimage_edit/distributed/group_coordinator.py
    • Added a new file defining GroupCoordinator and SequenceParallelGroupCoordinator classes.
    • GroupCoordinator manages PyTorch ProcessGroups for various communication primitives (all-reduce, all-gather, broadcast, send/recv).
    • SequenceParallelGroupCoordinator extends GroupCoordinator to specifically manage Ulysses and ring groups for sequence parallelism.
    • Modified the cpu_group backend from 'gloo' to 'hccl' for compatibility with Ascend environments.
  • ais_bench/benchmark/models/local_models/qwenimage_edit/distributed/parallel_mgr.py
    • Added a new file for managing different types of parallel groups (tensor, sequence, classifier-free guidance).
    • Defined ParallelConfig and helper functions (get_tp_group, get_sp_group, get_cfg_group) to query and access parallel group information.
    • Included init_distributed_environment and initialize_model_parallel for setting up the distributed execution environment.
    • Integrated yunchang for sequence parallel process group initialization and adapted init_distributed_environment to use torch_npu.npu.set_device for Ascend devices.
  • ais_bench/benchmark/models/local_models/qwenimage_edit/distributed/utils.py
    • Added a new utility file for generating distributed rank groups.
    • Implemented generate_masked_orthogonal_rank_groups to create orthogonal parallel groups based on specified parallel sizes and masks.
    • Introduced RankGenerator to facilitate the generation of rank groups for tensor, sequence, and CFG parallelism degrees.
  • ais_bench/benchmark/models/local_models/qwenimage_edit/pipeline_qwenimage_edit_plus.py
    • Added a new file defining QwenImageEditPlusPipeline, the core pipeline for Qwen-Image-Edit.
    • Integrated QwenImageTransformer2DModel and FlowMatchEulerDiscreteScheduler into the pipeline.
    • Implemented logic for prompt encoding, latent variable preparation, and the denoising loop for image generation.
    • Incorporated distributed Classifier-Free Guidance (CFG) logic using get_classifier_free_guidance_world_size() and get_cfg_group().all_gather() for parallel CFG computation.
    • Introduced COND_CACHE and UNCOND_CACHE environment variables for conditional and unconditional caching, and a USE_NPU flag for NPU device detection.
  • ais_bench/benchmark/models/local_models/qwenimage_edit/scheduling_flow_match_euler_discrete.py
    • Added a new file defining FlowMatchEulerDiscreteScheduler, a custom Euler scheduler for flow-matching diffusion models.
    • Implemented support for dynamic timestep shifting and various sigma schedules, including Karras, exponential, and beta.
    • Included _time_shift_exponential and _time_shift_linear methods for flexible timestep adjustments.
  • ais_bench/benchmark/models/local_models/qwenimage_edit/transformer_qwenimage.py
    • Added a new file defining QwenImageTransformer2DModel and QwenImageTransformerBlock.
    • QwenImageTransformer2DModel serves as the central transformer for Qwen-Image-Edit, processing image and text embeddings through dual-stream attention blocks.
    • QwenImageTransformerBlock implements a dual-stream attention mechanism (QwenDoubleStreamAttnProcessor2_0) for joint processing of image and text inputs.
    • Integrated NPU-specific optimizations: torch_npu.npu_layer_norm_eval for AdaLayerNorm and torch_npu.npu_rotary_mul for apply_rotary_emb_qwen, controlled by ADALN_FUSE and ROPE_FUSE environment variables.
    • Added cache_cond and cache_uncond attributes for caching intermediate states.
  • ais_bench/benchmark/utils/file/file.py
    • Added mmap and orjson imports for enhanced file I/O operations.
    • Implemented load_jsonl and dump_jsonl functions for efficient reading and writing of JSONL files using mmap for memory-mapped access and orjson for faster JSON parsing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new features and components for benchmarking, including a JudgeInfer worker that uses a language model for evaluation and new code for the qwenimage_edit model with distributed execution utilities. However, two significant security vulnerabilities were identified: an insecure deserialization flaw using the pickle module in the distributed communication logic, and a potential prompt injection vulnerability in the LLM-as-a-judge prompt template. Addressing these is crucial for the framework's security. Additionally, the review recommends implementing safer task grouping, improving file handling operations, clarifying prompts for better model performance, and removing debugging artifacts and non-English comments to enhance overall robustness and maintainability.

object_tensor, src=self.ranks[src], group=self.cpu_group
)

obj = pickle.loads(object_tensor.numpy().tobytes())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The recv_object method uses pickle.loads to deserialize data received from other ranks in the distributed group. pickle is known to be insecure and can lead to arbitrary code execution if the input data is untrusted. In a distributed environment, if one node is compromised, an attacker could use this to gain control over other nodes in the cluster. It is recommended to use a safer serialization format such as JSON or to implement cryptographic signing and verification of the pickled data.

Comment on lines +223 to +224
os.remove(judge_org_prediction_path)
dump_jsonl(judge_preds, judge_org_prediction_path)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of updating the prediction file by removing it and then writing to the same path is unsafe. If the dump_jsonl operation fails for any reason (e.g., disk full, permission error), the original prediction file will be lost. A safer pattern is to write the new content to a temporary file and then atomically rename it to the final destination.

            temp_judge_org_prediction_path = judge_org_prediction_path + ".tmp"
            dump_jsonl(judge_preds, temp_judge_org_prediction_path)
            os.replace(temp_judge_org_prediction_path, judge_org_prediction_path)


<Original Question Begin>: \n{question}\n<Original Question End>\n\n
<Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{model_answer}\n<Predicted End>\n\n

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

This GRADER_TEMPLATE is susceptible to prompt injection due to the direct embedding of the untrusted {model_answer}. A malicious model could exploit this to manipulate evaluation results. It is critical to use clear delimiters and sanitize the model output to prevent such attacks. Furthermore, the template contains conflicting instructions, asking for "A" or "B" in multiple places but also "CORRECT, INCORRECT" on line 48. This inconsistency can confuse the language model and lead to unreliable evaluations. The instructions should be aligned to consistently expect "A" or "B" to match the post-processing logic.

Comment on lines +176 to +182
key = (
task["models"][0]["abbr"] # same model
+ "_"
+ str(task['datasets'][0][0]['type']) # same dataset type
+ "_"
+ str(task["datasets"][0][0]["infer_cfg"]["inferencer"]) # same inferencer with the same args
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current method of generating a key for grouping tasks by concatenating strings with _ is not robust. If any of the components (like task["models"][0]["abbr"]) contain an underscore, it could lead to incorrect grouping. Using a tuple as the key would be a safer and more reliable approach.

            key = (
                task["models"][0]["abbr"],  # same model
                str(task['datasets'][0][0]['type']),  # same dataset type
                str(task["datasets"][0][0]["infer_cfg"]["inferencer"]),  # same inferencer with the same args
            )

os.remove(final_org_results_path)

if os.path.exists(cur_results_path):
# 基于cur_results_path的文件复制一份final_org_results_path

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment is in Chinese. To maintain consistency and readability for all contributors, please write comments in English.

                    # Copy the current results file to the final original results path

Comment on lines +126 to +129
# 加载被测模型的推理结果(排序后)
predictions: list = self._load_from_predictions(predictions_path)

# 为数据集添加 model_answer 列

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comments on lines 126 and 129 are in Chinese. For consistency and to make the code accessible to a wider audience, please write comments in English.

Suggested change
# 加载被测模型的推理结果(排序后)
predictions: list = self._load_from_predictions(predictions_path)
# 为数据集添加 model_answer
# Load the inference results of the model under test (sorted)
predictions: list = self._load_from_predictions(predictions_path)
# Add model_answer column to the dataset

Comment on lines +14 to +17
def get_a_or_b(pred: str) -> str:
"""从模型回复中提取A或B"""
match = re.search(r'[AB]', pred[-1:])
return match.group(0) if match else 'B'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The implementation of get_a_or_b is not very robust as it only checks the last character of the prediction string. If the model's output has trailing whitespace or newlines (e.g., "A\n"), this function will fail to extract the intended character. A more robust approach would be to search for 'A' or 'B' at the beginning of the stripped string. Also, the docstring is in Chinese and should be translated to English.

Suggested change
def get_a_or_b(pred: str) -> str:
"""从模型回复中提取A或B"""
match = re.search(r'[AB]', pred[-1:])
return match.group(0) if match else 'B'
def get_a_or_b(pred: str) -> str:
"""Extracts 'A' or 'B' from the model's response."""
if not isinstance(pred, str):
return 'B'
match = re.search(r"^[AB]", pred.strip())
return match.group(0) if match else 'B'

Comment on lines +35 to +36
# 分支 1:scatter_idx=2 且 gather_idx=1(Ulysses 并行的 “拆分多头” 场景),按「多头维度(dim2)」拆分张量,同时将「序列维度(dim1)」重组为完整长度。
if scatter_idx == 2 and gather_idx == 1:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment is in Chinese. To maintain consistency and readability for all contributors, please write comments in English.

Suggested change
# 分支 1:scatter_idx=2 且 gather_idx=1(Ulysses 并行的 “拆分多头” 场景),按「多头维度(dim2)」拆分张量,同时将「序列维度(dim1)」重组为完整长度。
if scatter_idx == 2 and gather_idx == 1:
# Branch 1: scatter_idx=2 and gather_idx=1 (Ulysses parallel "split heads" scenario), split tensor by head dimension (dim2), and reassemble sequence dimension (dim1) to full length.

# cpu_group = torch.distributed.new_group(ranks, backend="gloo")

# 修改后(使用HCCL后端)
cpu_group = torch.distributed.new_group(ranks, backend="hccl") # 适配昇腾环境

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable cpu_group is initialized with the hccl backend, which is intended for Huawei NPUs (Ascend), not CPUs. This is misleading. If this process group is indeed for CPU-based communication, the gloo backend is more appropriate. If it's for device communication, the variable should be renamed to reflect that. The comment # 适配昇腾环境 (adapt to Ascend environment) should also be in English.

Suggested change
cpu_group = torch.distributed.new_group(ranks, backend="hccl") # 适配昇腾环境
cpu_group = torch.distributed.new_group(ranks, backend="hccl") # Adapt to Ascend environment

Comment on lines +457 to +463
print("ljf 进入采样器,涉及随机")
x0 = sample - current_sigma * model_output
noise = torch.randn_like(sample)
prev_sample = (1.0 - next_sigma) * x0 + next_sigma * noise
else:
print("ljf 进入采样器,无随机")
prev_sample = sample + dt * model_output

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These print statements with Chinese text appear to be for debugging. They should be removed from the final code to keep the output clean and professional.

Suggested change
print("ljf 进入采样器,涉及随机")
x0 = sample - current_sigma * model_output
noise = torch.randn_like(sample)
prev_sample = (1.0 - next_sigma) * x0 + next_sigma * noise
else:
print("ljf 进入采样器,无随机")
prev_sample = sample + dt * model_output
if self.config.stochastic_sampling:
x0 = sample - current_sigma * model_output
noise = torch.randn_like(sample)
prev_sample = (1.0 - next_sigma) * x0 + next_sigma * noise
else:
prev_sample = sample + dt * model_output

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments