Skip to content

feat: implement quota-aware prequeue scheduling#393

Merged
Jinghao-coding merged 1 commit intoraids-lab:mainfrom
YiD11:feat/quota
Apr 22, 2026
Merged

feat: implement quota-aware prequeue scheduling#393
Jinghao-coding merged 1 commit intoraids-lab:mainfrom
YiD11:feat/quota

Conversation

@YiD11
Copy link
Copy Markdown
Contributor

@YiD11 YiD11 commented Apr 20, 2026

Add the new prequeue scheduling flow for normal and backfill jobs. Normal jobs now carry schedule type and waiting tolerance metadata, enter prequeue when user queue quota is exceeded or timed-out normal jobs block the same resource domain, and can trigger backfill preemption after exceeding the configured tolerance.

Add queue quota persistence, migrations, generated query layers, admin APIs, Swagger updates, and frontend management UI for per-queue resource limits and prequeue runtime settings. Surface job resource usage summaries in the frontend and wire job submit forms to schedule type selection, quota checks, and prequeue status handling.

Make PrequeueWatcher always-on and database-configurable with full-scan activation, quota-aware candidate selection, timed-out job blocking, and single-node backfill preemption.

Copilot AI review requested due to automatic review settings April 20, 2026 15:56
@YiD11 YiD11 added backend web backend or deployments frontend website and web frontend labels Apr 20, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 在 Crater 的 Volcano 调度链路前新增“预排队(Prequeue)”准入层,引入按队列/用户资源限额与回填(Backfill)作业的配额感知调度与抢占流程,并配套落库配置、管理端 API 与前端配置/展示能力。

Changes:

  • 后端:新增 PrequeueWatcher(常驻、数据库可配置),支持全量扫描激活、配额感知候选选择、超时普通作业阻塞与单机回填抢占;补齐作业调度元数据(scheduleType / waitingTolerance)与 Prequeue 配置持久化、迁移与管理 API。
  • 前端:新增调度类型展示/筛选、作业提交表单支持选择 scheduleType(受 backfill 开关控制)、资源占用汇总展示,以及预排队相关的系统配置管理 UI 与 i18n 文案。
  • 文档:补充多语言的预排队与用户资源限制管理文档页。

Reviewed changes

Copilot reviewed 97 out of 98 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
website/content/docs/admin/more/prequeue.mdx 新增预排队/用户资源限制中文管理文档
website/content/docs/admin/more/prequeue.ko.mdx 新增韩文管理文档
website/content/docs/admin/more/prequeue.jp.mdx 新增日文管理文档
website/content/docs/admin/more/prequeue.en.mdx 新增英文管理文档
frontend/src/utils/form.ts 增加多 task 资源聚合工具函数
frontend/src/utils/accelerator.ts 抽离加速卡 vendor/样式解析工具
frontend/src/services/api/vcjob.ts 增加 ScheduleType、Prequeue phase 及统一 phase 映射
frontend/src/services/api/system-config.ts 增加预排队系统配置管理 API 定义
frontend/src/services/api/queue-quota.ts 新增队列限额(queue quota)管理 API
frontend/src/services/api/context.ts 新增 prequeue 状态与资源汇总 context API
frontend/src/routes/portal/overview/index.tsx 总览页增加 scheduleType 展示/筛选并统一 phase 统计口径
frontend/src/routes/portal/jobs/new/webide-job.tsx WebIDE 提交流程接入 scheduleType 与统一提交按钮
frontend/src/routes/portal/jobs/new/tensorflow-ps-job.tsx TensorFlow 提交流程统一提交按钮
frontend/src/routes/portal/jobs/new/single-job.tsx 单任务训练提交接入 scheduleType 与统一提交按钮
frontend/src/routes/portal/jobs/new/seacs-job.tsx SEACS 提交流程统一提交按钮
frontend/src/routes/portal/jobs/new/pytorch-ddp-job.tsx PyTorch DDP 提交流程统一提交按钮
frontend/src/routes/portal/jobs/new/jupyter-job.tsx Jupyter 提交流程接入 scheduleType 与统一提交按钮
frontend/src/routes/portal/jobs/new/emias-jupyter-job.tsx EMIAS Jupyter 提交流程统一提交按钮
frontend/src/routes/portal/jobs/new/emias-job.tsx EMIAS 自定义作业提交流程统一提交按钮
frontend/src/routes/portal/jobs/inter/index.tsx 交互作业列表增加 scheduleType + 资源汇总组件
frontend/src/routes/admin/more/-components/prequeue-settings.tsx 新增管理端预排队运行参数设置组件
frontend/src/i18n/locales/zhCN/translation.json 增加预排队/资源汇总/scheduleType 相关中文文案
frontend/src/i18n/locales/enUS/translation.json 增加预排队/资源汇总/scheduleType 相关英文文案
frontend/src/components/user/user-jobs.tsx 用户作业列表增加 scheduleType 列与统一 phase
frontend/src/components/layout/detail-page.tsx 调整详情页布局以适配滚动/撑满高度
frontend/src/components/job/statuses.ts 增加 scheduleType 筛选项与表头映射
frontend/src/components/job/overview/job-lock-sheet.tsx 回填作业禁用锁定申请入口并提示文案
frontend/src/components/job/overview/job-actions-menu.tsx 传递 scheduleType 给锁定弹窗
frontend/src/components/job/overview/emias-jobs.tsx 作业列表 brief 区域替换为资源汇总组件、统一 phase
frontend/src/components/job/overview/custom-jobs.tsx Volcano 作业列表增加 scheduleType + 资源汇总组件
frontend/src/components/job/overview/admin-jobs.tsx 管理员作业列表增加 scheduleType + 统一 phase
frontend/src/components/job/job-submit-button.tsx 新增通用“提交作业”按钮组件(带 loading)
frontend/src/components/job/detail/index.tsx 删除作业时区分管理员/用户 API,并刷新资源汇总缓存
frontend/src/components/form/schedule-type-form-field.tsx 新增 scheduleType 单选表单字段组件
frontend/src/components/button/loadable-button.tsx 支持额外 disabled(与 loading 组合)
frontend/src/components/badge/schedule-type-badge.tsx 新增调度类型 badge 展示
frontend/src/components/badge/job-phase-badge.tsx Pending 文案 i18n 化并统一 Prequeue->Pending 展示口径
frontend/src/components/badge/accelerator-badge.tsx 复用加速卡工具并优化展示(截断/tooltip)
backend/pkg/vcqueue/queue.go Volcano Queue 读写改为集群级对象 key,并新增 ResolveJobQueueName
backend/pkg/vcqueue/const.go 增加公共队列名常量
backend/pkg/utils/resource.go 增加资源相减与 ResourceList 字符串化工具
backend/pkg/utils/job.go 增加单机作业判断、资源域(resource domain)计算与阻塞判定
backend/pkg/prequeuewatcher/watcher_test.go 新增 PrequeueWatcher 激活事务与回滚测试
backend/pkg/prequeuewatcher/watcher.go 新增 watcher 主循环(信号 + ticker + 动态配置刷新)
backend/pkg/prequeuewatcher/scheduling.go 增加节点/Pod 调度约束匹配与可用资源计算
backend/pkg/prequeuewatcher/scan.go 增加全量扫描轮次编排与超时判断工具
backend/pkg/prequeuewatcher/runtime_config.go 增加 watcher 运行时配置加载/校验/热更新 ticker
backend/pkg/prequeuewatcher/resources.go 增加回填抢占资源子集搜索与 deficit 计算
backend/pkg/prequeuewatcher/preemption.go 增加单机 normal 超时触发回填抢占计划与执行
backend/pkg/prequeuewatcher/activation.go 增加预排队候选选择、配额检查与原子 claim+activate
backend/pkg/crclient/nodeclient.go UpdateNodeunschedule 返回“更新前是否不可调度”以便上层决策
backend/pkg/aitaskctl/controller.go 更新 gocyclo 注释说明
backend/internal/util/type_test.go 新增 MapToStruct 单测覆盖多 tag/时长/指针等场景
backend/internal/util/type.go 新增 MapToStruct(字符串 map 注入结构体)工具
backend/internal/service/vcjob/schedule.go 新增调度元数据 annotation key 与解析/写入函数
backend/internal/service/vcjob/runtime_test.go 新增调度元数据解析与 round-trip 测试
backend/internal/service/vcjob/runtime.go 增加资源计算、JobRecord 生成与激活/Ingress/Forwards 复用能力
backend/internal/service/prequeue_service_test.go 新增队列限额解析与资源汇总构建测试
backend/internal/service/config_service_test.go 新增预排队运行时配置解析/校验测试
backend/internal/service/config_service.go 增加 prequeue_configs 初始化、读取、更新与校验逻辑
backend/internal/handler/vcjob/webide.go WebIDE 创建流程接入 scheduleType、统一 submitJob
backend/internal/handler/vcjob/util.go 标签/注解构建支持 schedule 元数据与 forwards 注解写入
backend/internal/handler/vcjob/tensorflow.go TensorFlow 创建流程接入 scheduleType、统一 submitJob
backend/internal/handler/vcjob/pytorch.go PyTorch 创建流程接入 scheduleType、统一 submitJob
backend/internal/handler/vcjob/lifecycle.go 新增 submitJob:配额/阻塞判断后决定入 Prequeue 或直接激活
backend/internal/handler/vcjob/jupyter.go Jupyter 创建流程接入 scheduleType、统一 submitJob
backend/internal/handler/vcjob/custom.go Custom/Training 创建流程接入 scheduleType、统一 submitJob
backend/internal/handler/system_config.go 增加 admin/system-config/prequeue GET/PUT,并触发 watcher full scan
backend/internal/handler/queue_quota.go 新增队列限额管理 Admin API,并触发 watcher full scan
backend/internal/handler/node.go 节点恢复调度后触发 watcher full scan
backend/internal/handler/interface.go RegisterConfig 注入 PrequeueWatcher/PrequeueService
backend/internal/handler/approvalorder.go 禁止回填作业创建延长锁定审批单(后端校验)
backend/internal/handler/aijob/emias.go 更新 gocyclo 注释说明
backend/go.sum 依赖更新(含 component-helpers/sqlite 等)
backend/go.mod 增加 component-helpers 与 sqlite driver 等依赖
backend/dao/query/jobs.gen.go Job query 层新增 schedule_type / waiting_tolerance_seconds / queue 字段
backend/dao/query/gen.go query 集合新增 prequeue_config / queue_quota_limit
backend/dao/model/queue_quota.go 新增 queue_quotas 模型
backend/dao/model/prequeue_config.go 新增 prequeue_configs 模型与默认值常量
backend/dao/model/job.go Job 模型新增 Prequeue phase、ScheduleType、Queue、WaitingToleranceSeconds
backend/cmd/gorm-gen/models/migrate.go 增加队列/配额/预排队配置等迁移
backend/cmd/gorm-gen/curd/generate.go gorm-gen 增加新模型生成入口
backend/cmd/crater/helper/manager.go 将 PrequeueWatcher 加入 controller-runtime manager
backend/cmd/crater/helper/config.go 初始化 PrequeueService 与 PrequeueWatcher 注入依赖

Comment on lines +172 to +174
if candidate.ScheduleType != nil && *candidate.ScheduleType == model.ScheduleTypeBackfill && !backfillEnabled {
return false
}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

【核心规范】这里在 backfill 禁用时对 backfill 作业返回了 false(表示“不阻塞”),会导致 selectActivationCandidates 仍可能选择并激活 backfill 作业,与“允许提交回填作业”开关语义不一致。建议在 backfillEnabled=false 且 candidate 为 backfill 时直接判定为阻塞(或在候选列表阶段过滤掉 backfill 作业)。

Copilot uses AI. Check for mistakes.
Comment on lines 89 to 93
key: 'status',
title: '状态',
option: jobPhases,
defaultValues: ['Running', 'Pending'],
defaultValues: ['Running', 'Pending', 'Prequeue'],
},
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

【核心规范】状态列使用 getUnifiedJobPhase 将 Prequeue 统一为 Pending,且状态筛选项 jobPhases 中也没有 Prequeue;这里默认筛选值仍包含 'Prequeue',会造成筛选 UI/行为不一致(例如默认值包含一个不会出现在选项中的值)。建议去掉 'Prequeue' 默认值,或改为在筛选选项与 accessor 逻辑中显式支持 Prequeue。

Copilot uses AI. Check for mistakes.
Add the new prequeue scheduling flow for normal and backfill jobs. Normal jobs now carry schedule type and waiting tolerance metadata, enter prequeue when user queue quota is exceeded or timed-out normal jobs block the same resource domain, and can trigger backfill preemption after exceeding the configured tolerance.

Add queue quota persistence, migrations, generated query layers, admin APIs, Swagger updates, and frontend management UI for per-queue resource limits and prequeue runtime settings. Surface job resource usage summaries in the frontend and wire job submit forms to schedule type selection, quota checks, and prequeue status handling.

Make PrequeueWatcher always-on and database-configurable with full-scan activation, quota-aware candidate selection, timed-out job blocking, and single-node backfill preemption.
@Jinghao-coding Jinghao-coding merged commit 88302bd into raids-lab:main Apr 22, 2026
8 checks passed
MOONSakura0614 added a commit to MOONSakura0614/crater that referenced this pull request Apr 28, 2026
Main brought in:
- Billing flow: new Account/UserAccount/Job/Resource fields, BillingService
  with a patrol-loop cron (biling-base-loop), visibility controls UI
- Quota-aware prequeue scheduling: PrequeueConfig/QueueQuotaLimit tables,
  PrequeueService, admin config + frontend prequeue page (raids-lab#393)
- CPU pinning hint (raids-lab#384) - job-new forms gain a pin toggle
- Heterogeneous GPU / accelerator helpers in frontend/src/utils/accelerator.ts
- Billing visibility utils, queue-quota/billing API clients
- Adjusted signatures:
    aitaskctl.CheckResourcesBeforeCreateJob  → (resources, error)
    nodeClient.UpdateNodeunschedule          → (bool, error)
    vcjob/util.go getLabelAndAnnotations     → (*CreateJobCommon, *jobScheduleMetadata)

Conflict resolutions:
- backend/cmd/crater/helper/config.go - wire both adminOpsReportService
  AND registerConfig.BillingService into NewCronJobManager; keep both
  SetCronJobManager + EnsureBuiltinCronJobs calls.
- backend/pkg/cronjob/manger.go - NewCronJobManager takes both patrol
  services; Clients struct keeps both fields.
- backend/pkg/patrol/patrol.go - keep all four cron constants (GPU
  analysis, admin ops report, storage daily audit, billing base loop)
  plus all three interfaces; dispatch switch handles every case.
- backend/cmd/gorm-gen/models/migrate.go - concat agent migrations
  (202604220001..202604230002) then main's billing/prequeue migrations
  (202603311930..202604171200). Removed a stray duplicated billing
  block that sneaked in during conflict editing.
- backend/internal/handler/approvalorder.go - keep both the job-validation
  pre-flight check from main AND the auto-approval branch from HEAD;
  imports merged (context + encoding/json + errors).
- backend/internal/service/config_service.go - keep HEAD's multi-line
  UpdateLLMConfig signature.
- backend/internal/handler/agent/tools_cluster_write.go - adapt cordon /
  uncordon to the new (bool, error) return from UpdateNodeunschedule.
- backend/internal/handler/vcjob/agent_submit.go - adapt
  CheckResourcesBeforeCreateJob destructuring and the new
  getLabelAndAnnotations(jobType, token, baseURL, *CreateJobCommon, nil)
  signature (scheduleMetadata is nil; the agent submission path uses
  default scheduling for now).
- frontend/src/routes/admin/cronjobs/index.tsx - registry includes
  trigger-admin-ops-report-job AND biling-base-loop patrol rows.
- backend/docs/docs.go, swagger.json, swagger.yaml - take main's
  regenerated swagger; will be re-generated next time we run swag.
- .gitignore - union of HEAD block and main's .codex entry.

Both builds pass:
  backend:  go build ./...         (clean)
  frontend: pnpm build, tsc --noEmit (only pre-existing orders/$id.tsx error)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend web backend or deployments frontend website and web frontend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants