feat: implement quota-aware prequeue scheduling#393
Merged
Jinghao-coding merged 1 commit intoraids-lab:mainfrom Apr 22, 2026
Merged
feat: implement quota-aware prequeue scheduling#393Jinghao-coding merged 1 commit intoraids-lab:mainfrom
Jinghao-coding merged 1 commit intoraids-lab:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
该 PR 在 Crater 的 Volcano 调度链路前新增“预排队(Prequeue)”准入层,引入按队列/用户资源限额与回填(Backfill)作业的配额感知调度与抢占流程,并配套落库配置、管理端 API 与前端配置/展示能力。
Changes:
- 后端:新增 PrequeueWatcher(常驻、数据库可配置),支持全量扫描激活、配额感知候选选择、超时普通作业阻塞与单机回填抢占;补齐作业调度元数据(scheduleType / waitingTolerance)与 Prequeue 配置持久化、迁移与管理 API。
- 前端:新增调度类型展示/筛选、作业提交表单支持选择 scheduleType(受 backfill 开关控制)、资源占用汇总展示,以及预排队相关的系统配置管理 UI 与 i18n 文案。
- 文档:补充多语言的预排队与用户资源限制管理文档页。
Reviewed changes
Copilot reviewed 97 out of 98 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| website/content/docs/admin/more/prequeue.mdx | 新增预排队/用户资源限制中文管理文档 |
| website/content/docs/admin/more/prequeue.ko.mdx | 新增韩文管理文档 |
| website/content/docs/admin/more/prequeue.jp.mdx | 新增日文管理文档 |
| website/content/docs/admin/more/prequeue.en.mdx | 新增英文管理文档 |
| frontend/src/utils/form.ts | 增加多 task 资源聚合工具函数 |
| frontend/src/utils/accelerator.ts | 抽离加速卡 vendor/样式解析工具 |
| frontend/src/services/api/vcjob.ts | 增加 ScheduleType、Prequeue phase 及统一 phase 映射 |
| frontend/src/services/api/system-config.ts | 增加预排队系统配置管理 API 定义 |
| frontend/src/services/api/queue-quota.ts | 新增队列限额(queue quota)管理 API |
| frontend/src/services/api/context.ts | 新增 prequeue 状态与资源汇总 context API |
| frontend/src/routes/portal/overview/index.tsx | 总览页增加 scheduleType 展示/筛选并统一 phase 统计口径 |
| frontend/src/routes/portal/jobs/new/webide-job.tsx | WebIDE 提交流程接入 scheduleType 与统一提交按钮 |
| frontend/src/routes/portal/jobs/new/tensorflow-ps-job.tsx | TensorFlow 提交流程统一提交按钮 |
| frontend/src/routes/portal/jobs/new/single-job.tsx | 单任务训练提交接入 scheduleType 与统一提交按钮 |
| frontend/src/routes/portal/jobs/new/seacs-job.tsx | SEACS 提交流程统一提交按钮 |
| frontend/src/routes/portal/jobs/new/pytorch-ddp-job.tsx | PyTorch DDP 提交流程统一提交按钮 |
| frontend/src/routes/portal/jobs/new/jupyter-job.tsx | Jupyter 提交流程接入 scheduleType 与统一提交按钮 |
| frontend/src/routes/portal/jobs/new/emias-jupyter-job.tsx | EMIAS Jupyter 提交流程统一提交按钮 |
| frontend/src/routes/portal/jobs/new/emias-job.tsx | EMIAS 自定义作业提交流程统一提交按钮 |
| frontend/src/routes/portal/jobs/inter/index.tsx | 交互作业列表增加 scheduleType + 资源汇总组件 |
| frontend/src/routes/admin/more/-components/prequeue-settings.tsx | 新增管理端预排队运行参数设置组件 |
| frontend/src/i18n/locales/zhCN/translation.json | 增加预排队/资源汇总/scheduleType 相关中文文案 |
| frontend/src/i18n/locales/enUS/translation.json | 增加预排队/资源汇总/scheduleType 相关英文文案 |
| frontend/src/components/user/user-jobs.tsx | 用户作业列表增加 scheduleType 列与统一 phase |
| frontend/src/components/layout/detail-page.tsx | 调整详情页布局以适配滚动/撑满高度 |
| frontend/src/components/job/statuses.ts | 增加 scheduleType 筛选项与表头映射 |
| frontend/src/components/job/overview/job-lock-sheet.tsx | 回填作业禁用锁定申请入口并提示文案 |
| frontend/src/components/job/overview/job-actions-menu.tsx | 传递 scheduleType 给锁定弹窗 |
| frontend/src/components/job/overview/emias-jobs.tsx | 作业列表 brief 区域替换为资源汇总组件、统一 phase |
| frontend/src/components/job/overview/custom-jobs.tsx | Volcano 作业列表增加 scheduleType + 资源汇总组件 |
| frontend/src/components/job/overview/admin-jobs.tsx | 管理员作业列表增加 scheduleType + 统一 phase |
| frontend/src/components/job/job-submit-button.tsx | 新增通用“提交作业”按钮组件(带 loading) |
| frontend/src/components/job/detail/index.tsx | 删除作业时区分管理员/用户 API,并刷新资源汇总缓存 |
| frontend/src/components/form/schedule-type-form-field.tsx | 新增 scheduleType 单选表单字段组件 |
| frontend/src/components/button/loadable-button.tsx | 支持额外 disabled(与 loading 组合) |
| frontend/src/components/badge/schedule-type-badge.tsx | 新增调度类型 badge 展示 |
| frontend/src/components/badge/job-phase-badge.tsx | Pending 文案 i18n 化并统一 Prequeue->Pending 展示口径 |
| frontend/src/components/badge/accelerator-badge.tsx | 复用加速卡工具并优化展示(截断/tooltip) |
| backend/pkg/vcqueue/queue.go | Volcano Queue 读写改为集群级对象 key,并新增 ResolveJobQueueName |
| backend/pkg/vcqueue/const.go | 增加公共队列名常量 |
| backend/pkg/utils/resource.go | 增加资源相减与 ResourceList 字符串化工具 |
| backend/pkg/utils/job.go | 增加单机作业判断、资源域(resource domain)计算与阻塞判定 |
| backend/pkg/prequeuewatcher/watcher_test.go | 新增 PrequeueWatcher 激活事务与回滚测试 |
| backend/pkg/prequeuewatcher/watcher.go | 新增 watcher 主循环(信号 + ticker + 动态配置刷新) |
| backend/pkg/prequeuewatcher/scheduling.go | 增加节点/Pod 调度约束匹配与可用资源计算 |
| backend/pkg/prequeuewatcher/scan.go | 增加全量扫描轮次编排与超时判断工具 |
| backend/pkg/prequeuewatcher/runtime_config.go | 增加 watcher 运行时配置加载/校验/热更新 ticker |
| backend/pkg/prequeuewatcher/resources.go | 增加回填抢占资源子集搜索与 deficit 计算 |
| backend/pkg/prequeuewatcher/preemption.go | 增加单机 normal 超时触发回填抢占计划与执行 |
| backend/pkg/prequeuewatcher/activation.go | 增加预排队候选选择、配额检查与原子 claim+activate |
| backend/pkg/crclient/nodeclient.go | UpdateNodeunschedule 返回“更新前是否不可调度”以便上层决策 |
| backend/pkg/aitaskctl/controller.go | 更新 gocyclo 注释说明 |
| backend/internal/util/type_test.go | 新增 MapToStruct 单测覆盖多 tag/时长/指针等场景 |
| backend/internal/util/type.go | 新增 MapToStruct(字符串 map 注入结构体)工具 |
| backend/internal/service/vcjob/schedule.go | 新增调度元数据 annotation key 与解析/写入函数 |
| backend/internal/service/vcjob/runtime_test.go | 新增调度元数据解析与 round-trip 测试 |
| backend/internal/service/vcjob/runtime.go | 增加资源计算、JobRecord 生成与激活/Ingress/Forwards 复用能力 |
| backend/internal/service/prequeue_service_test.go | 新增队列限额解析与资源汇总构建测试 |
| backend/internal/service/config_service_test.go | 新增预排队运行时配置解析/校验测试 |
| backend/internal/service/config_service.go | 增加 prequeue_configs 初始化、读取、更新与校验逻辑 |
| backend/internal/handler/vcjob/webide.go | WebIDE 创建流程接入 scheduleType、统一 submitJob |
| backend/internal/handler/vcjob/util.go | 标签/注解构建支持 schedule 元数据与 forwards 注解写入 |
| backend/internal/handler/vcjob/tensorflow.go | TensorFlow 创建流程接入 scheduleType、统一 submitJob |
| backend/internal/handler/vcjob/pytorch.go | PyTorch 创建流程接入 scheduleType、统一 submitJob |
| backend/internal/handler/vcjob/lifecycle.go | 新增 submitJob:配额/阻塞判断后决定入 Prequeue 或直接激活 |
| backend/internal/handler/vcjob/jupyter.go | Jupyter 创建流程接入 scheduleType、统一 submitJob |
| backend/internal/handler/vcjob/custom.go | Custom/Training 创建流程接入 scheduleType、统一 submitJob |
| backend/internal/handler/system_config.go | 增加 admin/system-config/prequeue GET/PUT,并触发 watcher full scan |
| backend/internal/handler/queue_quota.go | 新增队列限额管理 Admin API,并触发 watcher full scan |
| backend/internal/handler/node.go | 节点恢复调度后触发 watcher full scan |
| backend/internal/handler/interface.go | RegisterConfig 注入 PrequeueWatcher/PrequeueService |
| backend/internal/handler/approvalorder.go | 禁止回填作业创建延长锁定审批单(后端校验) |
| backend/internal/handler/aijob/emias.go | 更新 gocyclo 注释说明 |
| backend/go.sum | 依赖更新(含 component-helpers/sqlite 等) |
| backend/go.mod | 增加 component-helpers 与 sqlite driver 等依赖 |
| backend/dao/query/jobs.gen.go | Job query 层新增 schedule_type / waiting_tolerance_seconds / queue 字段 |
| backend/dao/query/gen.go | query 集合新增 prequeue_config / queue_quota_limit |
| backend/dao/model/queue_quota.go | 新增 queue_quotas 模型 |
| backend/dao/model/prequeue_config.go | 新增 prequeue_configs 模型与默认值常量 |
| backend/dao/model/job.go | Job 模型新增 Prequeue phase、ScheduleType、Queue、WaitingToleranceSeconds |
| backend/cmd/gorm-gen/models/migrate.go | 增加队列/配额/预排队配置等迁移 |
| backend/cmd/gorm-gen/curd/generate.go | gorm-gen 增加新模型生成入口 |
| backend/cmd/crater/helper/manager.go | 将 PrequeueWatcher 加入 controller-runtime manager |
| backend/cmd/crater/helper/config.go | 初始化 PrequeueService 与 PrequeueWatcher 注入依赖 |
Comment on lines
+172
to
+174
| if candidate.ScheduleType != nil && *candidate.ScheduleType == model.ScheduleTypeBackfill && !backfillEnabled { | ||
| return false | ||
| } |
There was a problem hiding this comment.
【核心规范】这里在 backfill 禁用时对 backfill 作业返回了 false(表示“不阻塞”),会导致 selectActivationCandidates 仍可能选择并激活 backfill 作业,与“允许提交回填作业”开关语义不一致。建议在 backfillEnabled=false 且 candidate 为 backfill 时直接判定为阻塞(或在候选列表阶段过滤掉 backfill 作业)。
Comment on lines
89
to
93
| key: 'status', | ||
| title: '状态', | ||
| option: jobPhases, | ||
| defaultValues: ['Running', 'Pending'], | ||
| defaultValues: ['Running', 'Pending', 'Prequeue'], | ||
| }, |
There was a problem hiding this comment.
【核心规范】状态列使用 getUnifiedJobPhase 将 Prequeue 统一为 Pending,且状态筛选项 jobPhases 中也没有 Prequeue;这里默认筛选值仍包含 'Prequeue',会造成筛选 UI/行为不一致(例如默认值包含一个不会出现在选项中的值)。建议去掉 'Prequeue' 默认值,或改为在筛选选项与 accessor 逻辑中显式支持 Prequeue。
Add the new prequeue scheduling flow for normal and backfill jobs. Normal jobs now carry schedule type and waiting tolerance metadata, enter prequeue when user queue quota is exceeded or timed-out normal jobs block the same resource domain, and can trigger backfill preemption after exceeding the configured tolerance. Add queue quota persistence, migrations, generated query layers, admin APIs, Swagger updates, and frontend management UI for per-queue resource limits and prequeue runtime settings. Surface job resource usage summaries in the frontend and wire job submit forms to schedule type selection, quota checks, and prequeue status handling. Make PrequeueWatcher always-on and database-configurable with full-scan activation, quota-aware candidate selection, timed-out job blocking, and single-node backfill preemption.
MOONSakura0614
added a commit
to MOONSakura0614/crater
that referenced
this pull request
Apr 28, 2026
Main brought in: - Billing flow: new Account/UserAccount/Job/Resource fields, BillingService with a patrol-loop cron (biling-base-loop), visibility controls UI - Quota-aware prequeue scheduling: PrequeueConfig/QueueQuotaLimit tables, PrequeueService, admin config + frontend prequeue page (raids-lab#393) - CPU pinning hint (raids-lab#384) - job-new forms gain a pin toggle - Heterogeneous GPU / accelerator helpers in frontend/src/utils/accelerator.ts - Billing visibility utils, queue-quota/billing API clients - Adjusted signatures: aitaskctl.CheckResourcesBeforeCreateJob → (resources, error) nodeClient.UpdateNodeunschedule → (bool, error) vcjob/util.go getLabelAndAnnotations → (*CreateJobCommon, *jobScheduleMetadata) Conflict resolutions: - backend/cmd/crater/helper/config.go - wire both adminOpsReportService AND registerConfig.BillingService into NewCronJobManager; keep both SetCronJobManager + EnsureBuiltinCronJobs calls. - backend/pkg/cronjob/manger.go - NewCronJobManager takes both patrol services; Clients struct keeps both fields. - backend/pkg/patrol/patrol.go - keep all four cron constants (GPU analysis, admin ops report, storage daily audit, billing base loop) plus all three interfaces; dispatch switch handles every case. - backend/cmd/gorm-gen/models/migrate.go - concat agent migrations (202604220001..202604230002) then main's billing/prequeue migrations (202603311930..202604171200). Removed a stray duplicated billing block that sneaked in during conflict editing. - backend/internal/handler/approvalorder.go - keep both the job-validation pre-flight check from main AND the auto-approval branch from HEAD; imports merged (context + encoding/json + errors). - backend/internal/service/config_service.go - keep HEAD's multi-line UpdateLLMConfig signature. - backend/internal/handler/agent/tools_cluster_write.go - adapt cordon / uncordon to the new (bool, error) return from UpdateNodeunschedule. - backend/internal/handler/vcjob/agent_submit.go - adapt CheckResourcesBeforeCreateJob destructuring and the new getLabelAndAnnotations(jobType, token, baseURL, *CreateJobCommon, nil) signature (scheduleMetadata is nil; the agent submission path uses default scheduling for now). - frontend/src/routes/admin/cronjobs/index.tsx - registry includes trigger-admin-ops-report-job AND biling-base-loop patrol rows. - backend/docs/docs.go, swagger.json, swagger.yaml - take main's regenerated swagger; will be re-generated next time we run swag. - .gitignore - union of HEAD block and main's .codex entry. Both builds pass: backend: go build ./... (clean) frontend: pnpm build, tsc --noEmit (only pre-existing orders/$id.tsx error)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add the new prequeue scheduling flow for normal and backfill jobs. Normal jobs now carry schedule type and waiting tolerance metadata, enter prequeue when user queue quota is exceeded or timed-out normal jobs block the same resource domain, and can trigger backfill preemption after exceeding the configured tolerance.
Add queue quota persistence, migrations, generated query layers, admin APIs, Swagger updates, and frontend management UI for per-queue resource limits and prequeue runtime settings. Surface job resource usage summaries in the frontend and wire job submit forms to schedule type selection, quota checks, and prequeue status handling.
Make PrequeueWatcher always-on and database-configurable with full-scan activation, quota-aware candidate selection, timed-out job blocking, and single-node backfill preemption.