Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
c345426
support cel filter
wangqianqianjun Aug 25, 2025
10251e4
Merge branch 'NexusGPU:main' into main
wangqianqianjun Aug 27, 2025
7be8e25
covert allocator request to cel filter
wangqianqianjun Aug 30, 2025
fc26511
support annotaion cel
wangqianqianjun Aug 31, 2025
6978807
remove deperate config
wangqianqianjun Aug 31, 2025
d3c112a
remove docs
wangqianqianjun Aug 31, 2025
d466cda
chore(deps): bump golang from 1.24 to 1.25 in /dockerfile (#325)
dependabot[bot] Sep 3, 2025
8bd5e89
chore(deps): bump cycjimmy/semantic-release-action from 4 to 5 (#338)
dependabot[bot] Sep 3, 2025
67b1c64
fix: helm chart issue (#346)
Code2Life Sep 3, 2025
5dc9c79
Merge branch 'NexusGPU:main' into main
wangqianqianjun Sep 3, 2025
dbc088c
chore(deps): bump k8s.io/kubernetes (#347)
dependabot[bot] Sep 4, 2025
865bdf5
fix: Potential fix for code scanning alert no. 36: Workflow does not …
0x5457 Sep 4, 2025
9006e96
support dedicated-gpu (#345)
wangqianqianjun Sep 4, 2025
0389852
fix: skip gpu limiter not working issue, observability optimize (#350)
Code2Life Sep 4, 2025
c0a3500
fix: init pricing overwrite vram to 0 (#351)
wangqianqianjun Sep 7, 2025
f25c65d
fix: add node hash for gpu k8s node, owner ref for hypervisor, isolat…
Code2Life Sep 8, 2025
e628187
fix: upgrade k8s 1.34, fix shm path, helm chart issues (#355)
Code2Life Sep 9, 2025
52d4fd2
cel fliter enhancement
wangqianqianjun Sep 9, 2025
e55e53d
fix: dedicated gpu annotation causing webhook failure issue (#356)
Code2Life Sep 10, 2025
0d77024
fix: extract GPU map update logic into separate method and fix webhoo…
Code2Life Sep 11, 2025
52dc0a4
cel fix phase filter
wangqianqianjun Sep 14, 2025
cd1d7dd
disable predicate fast path
wangqianqianjun Sep 14, 2025
f700eac
fix lint issue
wangqianqianjun Sep 14, 2025
8503585
Merge branch 'main' into dra
wangqianqianjun Sep 14, 2025
de5b0c1
chore(deps): bump github.com/aws/aws-sdk-go-v2 from 1.38.3 to 1.39.0 …
dependabot[bot] Sep 15, 2025
3d9b2c4
chore(deps): bump gorm.io/gorm from 1.30.3 to 1.31.0 (#361)
dependabot[bot] Sep 15, 2025
ec36d4a
chore(deps): bump k8s.io/client-go from 0.34.0 to 0.34.1 (#364)
dependabot[bot] Sep 15, 2025
40b98a8
chore(deps): bump k8s.io/component-helpers from 0.34.0 to 0.34.1 (#360)
dependabot[bot] Sep 15, 2025
a45ba60
chore(deps): bump sigs.k8s.io/controller-runtime from 0.22.0 to 0.22.…
dependabot[bot] Sep 15, 2025
5867f3c
feat: preempt support for GPU workers (#366)
Code2Life Sep 17, 2025
4fc9dc9
fix: add resource validation in Bind to prevent GPU over-allocation (…
0x5457 Sep 17, 2025
5f25794
webhook & gpu resource fit dra support
wangqianqianjun Sep 22, 2025
4959c61
resource template support
wangqianqianjun Sep 23, 2025
ff9efd2
support resource claim cel builder
wangqianqianjun Sep 24, 2025
f48f00a
fix conflict
wangqianqianjun Sep 24, 2025
1afc62d
fix conflict for gpuresources.go
wangqianqianjun Sep 28, 2025
efbce3f
1. support resource slice build and destory 2. make resource slice bu…
wangqianqianjun Sep 28, 2025
7d95fef
feat: Added DRA CEL filter support
wangqianqianjun Oct 4, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ jobs:
uses: actions/checkout@v5

- name: Setup Go
uses: actions/setup-go@v5
uses: actions/setup-go@v6
with:
go-version: '~1.24'

Expand Down
3 changes: 3 additions & 0 deletions .github/workflows/test-e2e.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
name: E2E Tests

permissions:
contents: read

on:
workflow_dispatch:

Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,13 +28,13 @@ jobs:
strategy:
matrix:
# from https://github.com/kubernetes-sigs/controller-tools/blob/main/envtest-releases.yaml
envtest_k8s_version: [1.23.5, 1.33.0]
envtest_k8s_version: [1.23.5, 1.34.0]
steps:
- name: Clone the code
uses: actions/checkout@v5

- name: Setup Go
uses: actions/setup-go@v5
uses: actions/setup-go@v6
with:
go-version: '~1.24'

Expand Down
3 changes: 2 additions & 1 deletion .vscode/launch.json
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,8 @@
"KUBECONFIG": "~/.kube/config-local-studio",
"ENABLE_WEBHOOKS": "false",
"ENABLE_SCHEDULER": "true",
"ENABLE_CR_CONTROLLER": "true"
"ENABLE_CR_CONTROLLER": "true",
"NVIDIA_OPERATOR_PROGRESSIVE_MIGRATION": "true"
},
"args": [
"--metrics-path", "${workspaceFolder}/logs/metrics.log",
Expand Down
6 changes: 6 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
"clientcmdapi",
"clientgoscheme",
"clientset",
"clientsetfake",
"cloudnative",
"cloudprovider",
"clusterissuers",
Expand All @@ -46,6 +47,8 @@
"envtest",
"essd",
"Eventf",
"evictable",
"featuregate",
"finalizer",
"Finalizers",
"frameworkruntime",
Expand Down Expand Up @@ -78,6 +81,8 @@
"iface",
"imageutils",
"influxdata",
"internalcache",
"internalqueue",
"jsonpatch",
"karpenter",
"karpv",
Expand Down Expand Up @@ -129,6 +134,7 @@
"schedulingconfigtemplate",
"schedulingconfigtemplates",
"schedulingcorev",
"schedv",
"serviceaccount",
"shirou",
"shortuuid",
Expand Down
6 changes: 6 additions & 0 deletions api/v1/gpupool_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,12 @@ type QosConfig struct {
Definitions []QosDefinition `json:"definitions,omitempty"`
DefaultQoS QoSLevel `json:"defaultQoS,omitempty"`
Pricing []QosPricing `json:"pricing,omitempty"`

// Eviction protection price ratio applied to cost calculation during protection period
// This multiplier increases pricing for protected workloads to discourage preemption
// +optional
// +kubebuilder:default="1.2"
EvictionProtectionPriceRatio string `json:"evictionProtectionPriceRatio,omitempty"`
}

type QosDefinition struct {
Expand Down
10 changes: 8 additions & 2 deletions api/v1/gpuresourcequota_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ package v1
import (
v1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/kubernetes/pkg/scheduler/framework"
fwk "k8s.io/kube-scheduler/framework"
)

// GPUResourceQuotaSpec defines the desired state of GPUResourceQuota
Expand Down Expand Up @@ -192,6 +192,12 @@ type AllocRequest struct {

// cel filter expression
CELFilterExpression string

QoS QoSLevel
}

func (p *AllocRequest) Clone() fwk.StateData {
return p
}

type GPUAllocationInfo struct {
Expand All @@ -209,7 +215,7 @@ type AdjustRequest struct {
NewLimit Resource
}

func (ar *AllocRequest) Clone() framework.StateData {
func (ar *AdjustRequest) Clone() fwk.StateData {
return ar
}

Expand Down
15 changes: 15 additions & 0 deletions api/v1/schedulingconfigtemplate_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,10 @@ type SchedulingConfigTemplateSpec struct {
// single GPU device multi-process queuing and fair scheduling with QoS constraint
// +optional
Hypervisor *HypervisorScheduling `json:"hypervisor,omitempty"`

// enable Dynamic Resource Allocation (DRA) for GPU resource management
// +optional
DRA *DRAConfig `json:"dra,omitempty"`
}

type PlacementConfig struct {
Expand Down Expand Up @@ -206,6 +210,17 @@ type MultiProcessQueuing struct {
QueueLevelTimeSlices []string `json:"queueLevelTimeSlices,omitempty"`
}

// DRAConfig configures Dynamic Resource Allocation support
type DRAConfig struct {
// Enable DRA mode for all workloads in this configuration template
// +optional
Enable *bool `json:"enable,omitempty"`

// ResourceClaimTemplateName specifies the ResourceClaim template name to use
// +optional
ResourceClaimTemplateName string `json:"resourceClaimTemplateName,omitempty"`
}

// SchedulingConfigTemplateStatus defines the observed state of SchedulingConfigTemplate.
type SchedulingConfigTemplateStatus struct {
// INSERT ADDITIONAL STATUS FIELD - define observed state of cluster
Expand Down
25 changes: 25 additions & 0 deletions api/v1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion charts/tensor-fusion/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ type: application
# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 1.5.5
version: 1.5.9

# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
Expand Down
6 changes: 6 additions & 0 deletions charts/tensor-fusion/crds/tensor-fusion.ai_gpupools.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -562,6 +562,12 @@ spec:
type: integer
type: object
type: array
evictionProtectionPriceRatio:
default: "1.2"
description: |-
Eviction protection price ratio applied to cost calculation during protection period
This multiplier increases pricing for protected workloads to discourage preemption
type: string
pricing:
items:
properties:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,20 @@ spec:
type: string
type: object
type: object
dra:
description: enable Dynamic Resource Allocation (DRA) for GPU resource
management
properties:
enable:
description: Enable DRA mode for all workloads in this configuration
template
type: boolean
resourceClass:
default: tensorfusion.ai/gpu
description: ResourceClass specifies the DRA resource class name
to use
type: string
type: object
hypervisor:
description: single GPU device multi-process queuing and fair scheduling
with QoS constraint
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -629,6 +629,12 @@ spec:
type: integer
type: object
type: array
evictionProtectionPriceRatio:
default: "1.2"
description: |-
Eviction protection price ratio applied to cost calculation during protection period
This multiplier increases pricing for protected workloads to discourage preemption
type: string
pricing:
items:
properties:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ webhooks:
namespace: {{ include "tensor-fusion.namespace" . }}
path: /mutate-v1-pod
failurePolicy: {{ .Values.controller.admissionWebhooks.failurePolicy }}
name: mpod-v1.kb.io
name: mpod.tensor-fusion.ai
rules:
- apiGroups:
- ""
Expand Down
1 change: 1 addition & 0 deletions charts/tensor-fusion/templates/controller-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ spec:
{{- end }}
serviceAccountName: {{ include "tensor-fusion.serviceAccountName" . }}
enableServiceLinks: false
priorityClassName: "system-cluster-critical"
containers:
- name: controller
image: "{{ .Values.controller.image.repository }}:{{ .Values.controller.image.tag | default .Chart.AppVersion }}"
Expand Down
18 changes: 15 additions & 3 deletions charts/tensor-fusion/templates/gpu-public-gpu-info.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,18 @@ data:
costPerHour: 1.64
fp16TFlops: 312

- model: A100_PCIe_40GB
fullModelName: "NVIDIA A100-PCIE-40GB"
vendor: NVIDIA
costPerHour: 1.64
fp16TFlops: 312

- model: A100_PCIe_80GB
fullModelName: "NVIDIA A100-PCIE-80GB"
vendor: NVIDIA
costPerHour: 1.64
fp16TFlops: 312

- model: A100_SXM_40G
fullModelName: "NVIDIA A100-SXM4-40GB"
vendor: NVIDIA
Expand All @@ -70,13 +82,13 @@ data:
fp16TFlops: 312

- model: A800_PCIe_80G
fullModelName: "NVIDIA A800 80GB PCIe"
fullModelName: "NVIDIA A800-PCIE-80GB"
vendor: NVIDIA
costPerHour: 1.64
fp16TFlops: 312

- model: A800_PCIe_40G
fullModelName: "NVIDIA A800 40GB PCIe"
fullModelName: "NVIDIA A800-PCIE-40GB"
vendor: NVIDIA
costPerHour: 1.64
fp16TFlops: 312
Expand All @@ -95,7 +107,7 @@ data:
fp16TFlops: 125

- model: A40
fullModelName: "NVIDIA A40 48GB PCIe"
fullModelName: "NVIDIA A40-PCIE-48GB"
vendor: NVIDIA
costPerHour: 0.4
fp16TFlops: 149.7
Expand Down
23 changes: 23 additions & 0 deletions charts/tensor-fusion/templates/priorityclass.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: tensor-fusion-critical
value: 100000
globalDefault: false
description: "TensorFusion critical priority"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: tensor-fusion-high
value: 10000
globalDefault: false
description: "TensorFusion high priority"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: tensor-fusion-medium
value: 0
globalDefault: false
description: "TensorFusion medium priority"
8 changes: 4 additions & 4 deletions charts/tensor-fusion/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ controller:
image:
repository: tensorfusion/tensor-fusion-operator
# Overrides the image tag whose default is the chart appVersion.
tag: "latest"
tag: "1.43.4"
# This is for setting Kubernetes Annotations to a Pod.
# For more information checkout: https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/

Expand Down Expand Up @@ -120,7 +120,7 @@ agent:

image:
repository: tensorfusion/tensor-fusion-agent
tag: "latest"
tag: "1.0.0"

resources:
requests:
Expand Down Expand Up @@ -169,8 +169,8 @@ schedulerConfig:
kind: KubeSchedulerConfiguration
clientConnection:
kubeconfig: ""
qps: 50
burst: 100
qps: 1000
burst: 2000
profiles:
# Refer: https://kubernetes.io/docs/reference/scheduling/config/
- schedulerName: tensor-fusion-scheduler
Expand Down
Loading