【TEST】补充math和agieval数据集的冒烟用例 by GaoHuaZhang · Pull Request #145 · AISBench/benchmark

GaoHuaZhang · 2026-02-11T03:19:37Z

Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。

PR Type / PR类型

Related Issue | 关联 Issue
Fixes #(issue ID / issue 编号) / Relates to #(issue ID / issue 编号)

🔍 Motivation / 变更动机

为 agieval、math 两个精度评测场景补充 smoke test，便于在 CI 中快速回归 run_server_accuracy / llm_datasets_main 流程，验证配置与环境正确性。
Add smoke tests for agieval and math accuracy evaluation scenarios to enable quick regression of run_server_accuracy / llm_datasets_main in CI and verify config and environment.

📝 Modification / 修改内容

新增 2 个 smoke test case：
- accuracy_agieval：agieval 数据集（冒烟仅跑子集 agieval-gaokao-chinese），vllm-api-general-chat。
- accuracy_math：math 数据集（子集 math_prm800k_500），vllm-api-general-chat。
每个 case 包含：case.yml、run.sh、clean.sh，以及 ais_bench_configs/ 下 datasets、models（vllm_api）配置；目录分别为 smoke_tests/test-case/run_server_accuracy/llm_datasets_main/accuracy_agieval/ 与 accuracy_math/。
配置来源：case 内复制的 config 来自 ais_bench/benchmark/configs/datasets/ 与 ais_bench/benchmark/configs/models/ 的现有配置，本次为拷贝到 smoke 用例目录并接好 run/clean，无修改既有业务逻辑。
共 10 个文件变更，+214 行。

📐 Associated Test Results / 关联测试结果

待 CI 运行后补充。 / To be added after CI run.

⚠️ BC-breaking (Optional) / 向后不兼容变更（可选）

无。仅新增 smoke test 用例，不涉及下游兼容性。 / None. New smoke test cases only; no downstream compatibility impact.

⚠️ Performance degradation (Optional) / 性能下降（可选）

无。 / None.

🌟 Use cases (Optional) / 使用案例（可选）

✅ Checklist / 检查列表

Before PR:

Pre-commit or other linting tools are used to fix the potential lint issues. / 使用预提交或其他 linting 工具来修复潜在的 lint 问题。
Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。
All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects. / 如果此拉取请求对下游或其他相关项目有潜在影响，应在那些项目中测试此 PR。
CLA has been signed and all committers have signed the CLA in this PR. / CLA 已签署，且本 PR 中的所有提交者均已签署 CLA。

👥 Collaboration Info / 协作信息

Suggested Reviewers / 建议审核人: @xxx
Relevant Module Owners / 相关模块负责人: @xxx
Other Collaboration Notes / 其他协作说明：

🌟 Useful CI Command / 实用的CI命令

Command / 命令	Introduction / 介绍
`/gemini review`	Performs a code review for the current pull request in its current state by Gemini. / 对当前拉取请求在当前状态下由 Gemini 执行代码审核。
`/gemini summary`	Provides a summary of the current pull request in its current state by Gemini. / 对当前拉取请求在当前状态下由 Gemini 提供摘要。
`/gemini help`	Displays a list of available commands of Gemini. / 显示 Gemini 可用命令的列表。
`/readthedocs build`	Triggers a build of the documentation for the current pull request in its current state by Read the Docs. / 触发当前拉取请求在当前状态下由 Read the Docs 构建文档。

gemini-code-assist · 2026-02-11T03:19:59Z

Summary of Changes

Hello @GaoHuaZhang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

此拉取请求旨在通过引入针对 math 和 agieval 数据集的新冒烟测试用例，增强项目的测试覆盖范围和稳定性。这些测试用例经过精心配置，以使用有限的数据样本和特定的模型设置，从而实现快速、高效的验证，确保 ais_bench 框架内关键数据集和模型的集成与基本功能正常运作。

Highlights

新增 AgiEval 数据集冒烟测试用例: 为 AgiEval 数据集添加了新的冒烟测试用例，旨在快速验证其基本功能和集成。
新增 Math 数据集冒烟测试用例: 为 Math 数据集添加了新的冒烟测试用例，以确保其核心功能在快速检查中正常运行。
数据集配置优化: 针对 AgiEval 和 Math 冒烟测试，数据集配置被限制为仅使用一个子集或小样本范围（[0:10]），以加快测试速度。
模型配置更新: 为 AgiEval 和 Math 冒烟测试配置了 vllm_api 模型，统一使用 'qwen' 模型，并设置了特定的 max_out_len 和 batch_size。
测试脚本和配置文件的引入: 每个新的冒烟测试都引入了 case.yml、run.sh 和 clean.sh 文件，用于定义测试案例、执行测试流程和清理测试环境。

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

smoke_tests/test-case/run_server_accuracy/llm_datasets_main/accuracy_agieval/ais_bench_configs/datasets/agieval/accuracy_agieval.py
- 为 AgiEval 冒烟测试配置了数据集，限制为单个子集和小的测试范围。
smoke_tests/test-case/run_server_accuracy/llm_datasets_main/accuracy_agieval/ais_bench_configs/models/vllm_api/accuracy_agieval.py
- 为 AgiEval 冒烟测试定义了模型配置，指定使用 'qwen' 模型，并设置了 max_out_len 和 batch_size。
smoke_tests/test-case/run_server_accuracy/llm_datasets_main/accuracy_agieval/case.yml
- 添加了 accuracy_agieval 冒烟测试的基准案例定义文件。
smoke_tests/test-case/run_server_accuracy/llm_datasets_main/accuracy_agieval/clean.sh
- 创建了 AgiEval 冒烟测试的清理脚本，用于删除临时日志文件。
smoke_tests/test-case/run_server_accuracy/llm_datasets_main/accuracy_agieval/run.sh
- 实现了 AgiEval 准确性冒烟测试的执行脚本，包括环境设置和结果验证。
smoke_tests/test-case/run_server_accuracy/llm_datasets_main/accuracy_math/ais_bench_configs/datasets/math/accuracy_math.py
- 为 Math 冒烟测试配置了数据集，限制为小的测试范围。
smoke_tests/test-case/run_server_accuracy/llm_datasets_main/accuracy_math/ais_bench_configs/models/vllm_api/accuracy_math.py
- 为 Math 冒烟测试定义了模型配置，指定使用 'qwen' 模型，并设置了 max_out_len 和 batch_size。
smoke_tests/test-case/run_server_accuracy/llm_datasets_main/accuracy_math/case.yml
- 添加了 accuracy_math 冒烟测试的基准案例定义文件。
smoke_tests/test-case/run_server_accuracy/llm_datasets_main/accuracy_math/clean.sh
- 创建了 Math 冒烟测试的清理脚本，用于删除临时日志文件。
smoke_tests/test-case/run_server_accuracy/llm_datasets_main/accuracy_math/run.sh
- 实现了 Math 准确性冒烟测试的执行脚本，包括环境设置和结果验证。

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

这个PR为math和agieval数据集添加了冒烟测试用例，这对于保证代码质量非常有帮助。在审查代码时，我注意到在accuracy_agieval和accuracy_math两个测试用例中，run.sh和clean.sh脚本存在大量的代码重复。为了提高代码的可维护性，强烈建议将这些脚本中的通用逻辑提取到共享脚本中。例如，可以创建一个通用的run.sh，由各个测试用例的run.sh来设置特定变量并调用它。这可以减少冗余代码，并使未来的修改更加容易。除此之外，我还针对run.sh中的重复代码块和clean.sh的简化提出了一些具体的建议。

gemini-code-assist · 2026-02-11T03:21:29Z

smoke_tests/test-case/run_server_accuracy/llm_datasets_main/accuracy_agieval/clean.sh

@@ -0,0 +1,4 @@
+#!/bin/bash
+CUR_DIR=$(dirname $(readlink -f $0))
+[ -f "${CUR_DIR}/tmplog.txt" ] && rm -f "${CUR_DIR}/tmplog.txt"


此行命令可以简化。rm -f 命令在文件不存在时不会报错，因此前面的 [ -f ... ] 检查是多余的。

另外，这个 clean.sh 脚本与 accuracy_math 测试用例中的脚本内容完全相同。为了避免代码重复，建议将此脚本统一管理，例如创建一个公共的清理脚本供所有测试用例使用。

Suggested change

[ -f "${CUR_DIR}/tmplog.txt" ] && rm -f "${CUR_DIR}/tmplog.txt"

rm -f "${CUR_DIR}/tmplog.txt"

gemini-code-assist · 2026-02-11T03:21:29Z

smoke_tests/test-case/run_server_accuracy/llm_datasets_main/accuracy_agieval/run.sh

+if [ ! -f "$LOG_EVAL_OUTPUT_PATH" ];then
+    echo "Can't find $LOG_EVAL_OUTPUT_PATH"
+    exit $ret_failed
+fi
+if [ ! -f "$LOG_INFER_OUTPUT_PATH" ];then
+    echo "Can't find $LOG_INFER_OUTPUT_PATH"
+    exit $ret_failed
+fi
+if [ ! -f "$PREDICTIONS_OUTPUT_PATH" ];then
+    echo "Can't find $PREDICTIONS_OUTPUT_PATH"
+    exit $ret_failed
+fi
+if [ ! -f "$RESULTS_OUTPUT_PATH" ];then
+    echo "Can't find $RESULTS_OUTPUT_PATH"
+    exit $ret_failed
+fi
+if [ ! -f "$SUMMARY_OUTPUT_PATH" ];then
+    echo "Can't find $SUMMARY_OUTPUT_PATH"
+    exit $ret_failed
+fi


这部分代码中存在大量重复的 if 语句来检查文件是否存在。为了提高代码的可读性和可维护性，建议使用一个循环来代替这些重复的检查。

Suggested change

if [ ! -f "$LOG_EVAL_OUTPUT_PATH" ];then

echo "Can't find $LOG_EVAL_OUTPUT_PATH"

exit $ret_failed

fi

if [ ! -f "$LOG_INFER_OUTPUT_PATH" ];then

echo "Can't find $LOG_INFER_OUTPUT_PATH"

exit $ret_failed

fi

if [ ! -f "$PREDICTIONS_OUTPUT_PATH" ];then

echo "Can't find $PREDICTIONS_OUTPUT_PATH"

exit $ret_failed

fi

if [ ! -f "$RESULTS_OUTPUT_PATH" ];then

echo "Can't find $RESULTS_OUTPUT_PATH"

exit $ret_failed

fi

if [ ! -f "$SUMMARY_OUTPUT_PATH" ];then

echo "Can't find $SUMMARY_OUTPUT_PATH"

exit $ret_failed

fi

for output_file in \

"$LOG_EVAL_OUTPUT_PATH" \

"$LOG_INFER_OUTPUT_PATH" \

"$PREDICTIONS_OUTPUT_PATH" \

"$RESULTS_OUTPUT_PATH" \

"$SUMMARY_OUTPUT_PATH"

do

if [ ! -f "$output_file" ]; then

echo "Can't find $output_file"

exit $ret_failed

fi

done

gemini-code-assist · 2026-02-11T03:21:29Z

smoke_tests/test-case/run_server_accuracy/llm_datasets_main/accuracy_math/clean.sh

@@ -0,0 +1,4 @@
+#!/bin/bash
+CUR_DIR=$(dirname $(readlink -f $0))
+[ -f "${CUR_DIR}/tmplog.txt" ] && rm -f "${CUR_DIR}/tmplog.txt"


此行命令可以简化。rm -f 命令在文件不存在时不会报错，因此前面的 [ -f ... ] 检查是多余的。

另外，这个 clean.sh 脚本与 accuracy_agieval 测试用例中的脚本内容完全相同。为了避免代码重复，建议将此脚本统一管理，例如创建一个公共的清理脚本供所有测试用例使用。

Suggested change

[ -f "${CUR_DIR}/tmplog.txt" ] && rm -f "${CUR_DIR}/tmplog.txt"

rm -f "${CUR_DIR}/tmplog.txt"

gemini-code-assist · 2026-02-11T03:21:29Z

smoke_tests/test-case/run_server_accuracy/llm_datasets_main/accuracy_math/run.sh

+if [ ! -f "$LOG_EVAL_OUTPUT_PATH" ];then
+    echo "Can't find $LOG_EVAL_OUTPUT_PATH"
+    exit $ret_failed
+fi
+if [ ! -f "$LOG_INFER_OUTPUT_PATH" ];then
+    echo "Can't find $LOG_INFER_OUTPUT_PATH"
+    exit $ret_failed
+fi
+if [ ! -f "$PREDICTIONS_OUTPUT_PATH" ];then
+    echo "Can't find $PREDICTIONS_OUTPUT_PATH"
+    exit $ret_failed
+fi
+if [ ! -f "$RESULTS_OUTPUT_PATH" ];then
+    echo "Can't find $RESULTS_OUTPUT_PATH"
+    exit $ret_failed
+fi
+if [ ! -f "$SUMMARY_OUTPUT_PATH" ];then
+    echo "Can't find $SUMMARY_OUTPUT_PATH"
+    exit $ret_failed
+fi


这部分代码中存在大量重复的 if 语句来检查文件是否存在。为了提高代码的可读性和可维护性，建议使用一个循环来代替这些重复的检查。这个问题在 accuracy_agieval/run.sh 中也同样存在。

Suggested change

if [ ! -f "$LOG_EVAL_OUTPUT_PATH" ];then

echo "Can't find $LOG_EVAL_OUTPUT_PATH"

exit $ret_failed

fi

if [ ! -f "$LOG_INFER_OUTPUT_PATH" ];then

echo "Can't find $LOG_INFER_OUTPUT_PATH"

exit $ret_failed

fi

if [ ! -f "$PREDICTIONS_OUTPUT_PATH" ];then

echo "Can't find $PREDICTIONS_OUTPUT_PATH"

exit $ret_failed

fi

if [ ! -f "$RESULTS_OUTPUT_PATH" ];then

echo "Can't find $RESULTS_OUTPUT_PATH"

exit $ret_failed

fi

if [ ! -f "$SUMMARY_OUTPUT_PATH" ];then

echo "Can't find $SUMMARY_OUTPUT_PATH"

exit $ret_failed

fi

for output_file in \

"$LOG_EVAL_OUTPUT_PATH" \

"$LOG_INFER_OUTPUT_PATH" \

"$PREDICTIONS_OUTPUT_PATH" \

"$RESULTS_OUTPUT_PATH" \

"$SUMMARY_OUTPUT_PATH"

do

if [ ! -f "$output_file" ]; then

echo "Can't find $output_file"

exit $ret_failed

fi

done

add smoke test

eec4f05

gemini-code-assist bot reviewed Feb 11, 2026

View reviewed changes

github-actions bot added the test-cases label Feb 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【TEST】补充math和agieval数据集的冒烟用例#145

【TEST】补充math和agieval数据集的冒烟用例#145
GaoHuaZhang wants to merge 1 commit intoAISBench:masterfrom
GaoHuaZhang:smoke_add

GaoHuaZhang commented Feb 11, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 11, 2026

Uh oh!

gemini-code-assist bot Feb 11, 2026

Uh oh!

gemini-code-assist bot Feb 11, 2026

Uh oh!

gemini-code-assist bot Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

	[ -f "${CUR_DIR}/tmplog.txt" ] && rm -f "${CUR_DIR}/tmplog.txt"
	rm -f "${CUR_DIR}/tmplog.txt"

Conversation

GaoHuaZhang commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Motivation / 变更动机

📝 Modification / 修改内容

📐 Associated Test Results / 关联测试结果

⚠️ BC-breaking (Optional) / 向后不兼容变更（可选）

⚠️ Performance degradation (Optional) / 性能下降（可选）

🌟 Use cases (Optional) / 使用案例（可选）

✅ Checklist / 检查列表

👥 Collaboration Info / 协作信息

🌟 Useful CI Command / 实用的CI命令

Uh oh!

gemini-code-assist bot commented Feb 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

GaoHuaZhang commented Feb 11, 2026 •

edited

Loading