Skip to content

Commit 6a943cd

Browse files
authored
fix: autoscale refactor, support multi rules and external scaler (#470)
* fix: dynamic auto scale eval interval * fix: autoscale refactor, support multi rules and external scaler * fix: autoscale unit test issue * fix: autoscaler refactor * fix: autoscale unit test issues * fix: unit test issue * fix: unit test issue * fix: simplify tests * fix: lint issue
1 parent 630675a commit 6a943cd

38 files changed

+2661
-1134
lines changed

.cursor/rules/requirement.mdc

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
---
2+
alwaysApply: true
3+
---
4+
5+
# Project Goals
6+
TensorFusion is building large scale heterogeneous GPU pooling and scheduling AI infra using cloudnative ecosystem projects libs, help enterprise save GPU costs, simplify O&M and increase observability, boost elasticity.
7+
8+
Underlying tech: in this repo: Kubebulder, Scheduler, CDI. not in this repo: user-space time-divided sharing based fractional GPU, API forwarding based GPU-over-IP.
9+
10+
Critical Modules:
11+
- pod mutating webhook for augment user pods, add needed inputs and outputs
12+
- advanced scheduler with allocator/GPU-resource vertical scaler/bin-packing/rebalancer/quotas
13+
- custom resource operator, GPU cluster -> pool -> gpunode -> gpu, gpunodeclaim -> node -> gpunode, maintain resources and TensorFusion components status, eval alerts etc.
14+
- hypervisor, works like kubelet, reconcile TensorFusion workers on each gpu node, discover and bin devices, multi-process priority and autoFreeze handlers, produce metrics etc.
15+
- server, for offering API to assign remote vGPU worker, expose system debug endpoints
16+
- cloud provider integration (direct integration or with karpenter).
17+
- indexallocator is a special module to resolve CDI device plugin Allocate interface can not get Pod info issue, without CDI container -> Pod matching, not possible to get advanced allocation info (hack before k8s DRA deployed). using dummy resource name and number to compose a special index pass to hypervisor. this is not general device plugin patter, need remember this context only when changing device allocation and device plugin related functions.
18+
19+
# Requirements
20+
21+
You are professional cloudnative and AI infra engineer. High quality, robust codes with Golang and k8s best practices.
22+
Confirm the plan, then write code.
23+
Always be user-centric, think the whole user workflow and scenario and how a AI inference/training app running on this system for every task, no hidden logic, concise and strong type definition
24+
Define fields are in @api/v1 package, always think best data structure when CRD changes are needed.
25+
Don't abstract too much nor abstract nothing, extract interface based on business understanding, don't extract interface when not needed.
26+
extract function when its larger than 50-80 lines, otherwise prefer simple single function for one responsibility of codes.
27+
use modern latest golang features, eg any rather than interface{}, generic typing if needed etc.
28+
Never reinvent wheels, think how kubernetes source codes and kubernetes SIGs do, leverage utils and constants packages and introduced dependencies.
29+
Always prioritize security, scalability, and maintainability.
30+
Think reconcile loop, memory consistency pattern, kubebuilder framework.
31+
Think k8s tricky issues like resource conflicts, finalizers, deepCopy rather than one field by one assignment, use equality.semantic.DeepEqual rather than hard code comparing.
32+
Never write large task at once, break to smaller ones.
33+
Only write necessary comments, e.g for some complex algorithm and background info, never write stupid comment.
34+
Always remember to add events by kubernetes event recorder and logs for KEY code paths, which are important for user observability and troubleshooting, but events should not be too many.
35+
Always test-driven, write ginkgo based test cases, continue to run go/ginkgo test commands, review codes and refactor until test works, if test not work or perform, continue.
36+
When the task introduce some new memory state, consider expose it to server module for troubleshooting

.vscode/launch.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@
6161
"KUBECONFIG": "~/.kube/config-local-studio",
6262
"ENABLE_WEBHOOKS": "false",
6363
"ENABLE_SCHEDULER": "true",
64-
"ENABLE_CR_CONTROLLER": "true",
64+
"ENABLE_CR_CONTROLLER": "false",
6565
"NVIDIA_OPERATOR_PROGRESSIVE_MIGRATION": "true"
6666
},
6767
"args": [
@@ -70,7 +70,7 @@
7070
"--dynamic-config", "${workspaceFolder}/config/samples/dynamic-config.yaml",
7171
"--scheduler-config", "${workspaceFolder}/config/samples/scheduler-config.yaml",
7272
// "--enable-alert",
73-
// "--enable-auto-scale",
73+
"--enable-auto-scale",
7474
"--enable-auto-expander",
7575
"-v", "4"
7676
],

api/v1/schedulingconfigtemplate_types.go

Lines changed: 82 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ limitations under the License.
1717
package v1
1818

1919
import (
20+
v1 "k8s.io/api/core/v1"
2021
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
2122
"k8s.io/apimachinery/pkg/runtime"
2223
)
@@ -29,10 +30,12 @@ type SchedulingConfigTemplateSpec struct {
2930

3031
// scale the workload based on the usage and traffic
3132
// +optional
32-
AutoScaling *AutoScalingConfig `json:"autoScaling,omitempty"`
33+
VerticalScalingRules []VerticalScalingRule `json:"verticalScalingRules,omitempty"`
3334

3435
// avoid hot GPU devices and continuously balance the workload
35-
// implemented by trigger a simulation scheduling and advise better GPU nodes for scheduler
36+
// implemented by mark GPU as hot and trigger evict for re-scheduling
37+
// The hot GPUs will get lower priority for scheduling
38+
// TODO: not implemented yet
3639
// +optional
3740
ReBalancer *ReBalancerConfig `json:"reBalancer,omitempty"`
3841

@@ -41,6 +44,14 @@ type SchedulingConfigTemplateSpec struct {
4144
Hypervisor *HypervisorScheduling `json:"hypervisor,omitempty"`
4245
}
4346

47+
type VerticalScalingRule struct {
48+
Name string `json:"name,omitempty"`
49+
50+
// Rule auto applied in webhook, when pod matches the selector,
51+
// the rule will be added into workload profile's autoScalingConfig and annotation
52+
Selector metav1.LabelSelector `json:"selector,omitempty"`
53+
Rule *AutoScalingConfig `json:"autoScaling,omitempty"`
54+
}
4455
type PlacementConfig struct {
4556
// +kubebuilder:default=NodeCompactGPULowLoad
4657
Mode PlacementMode `json:"mode"`
@@ -89,16 +100,13 @@ type GPUFilter struct {
89100
}
90101

91102
type AutoScalingConfig struct {
92-
// layer 1 adjusting, to match the actual usage in the long run, only for N:M remote vGPU mode
93-
// Adjust baseline requests to match the actual usage in longer period, such as 1day - 2weeks
94-
AutoSetResources AutoSetResources `json:"autoSetResources,omitempty"`
95-
96-
// layer 2 horizontal auto-scaling, scale up to more GPU cards if max limits threshold hit
97-
// HPA-like, aggregate metrics data 1m-1h (when tf-worker scaled-up, should also trigger client pod's owner[Deployment etc.]'s replica increasing, check if KNative works)
98-
AutoSetReplicas AutoSetReplicas `json:"autoSetReplicas,omitempty"`
103+
// Adjust baseline requests and limits to match the actual usage using recent metrics
104+
AutoSetResources *AutoSetResources `json:"autoSetResources,omitempty"`
99105

100106
// CronScalingRules defines a list of CronScaling rules used to schedule scaling actions based on cron expressions.
101107
CronScalingRules []CronScalingRule `json:"cronScalingRules,omitempty"`
108+
109+
ExternalScaler *ExternalScalerConfig `json:"externalScaler,omitempty"`
102110
}
103111

104112
// CronScalingRule defines the rule for scaling resources based on a cron schedule.
@@ -115,102 +123,103 @@ type CronScalingRule struct {
115123
End string `json:"end,omitempty"`
116124
// DesiredResources specifies the target resources to scale to during the schedule.
117125
DesiredResources Resources `json:"desiredResources,omitempty"`
118-
// DesiredReplicas is the target number of replicas during the schedule.
119-
DesiredReplicas *int32 `json:"desiredReplicas,omitempty"`
120126
}
121127

122128
type AutoSetResources struct {
123129
Enable bool `json:"enable,omitempty"`
124130

125-
// Target resource to scale, such as "tflops", "vram", or "all" by default
126-
TargetResource string `json:"targetResource,omitempty"`
131+
// Target resource to scale, such as "compute", "vram", or "all" by default
132+
TargetResource ScalingTargetResource `json:"targetResource,omitempty"`
127133

128-
// Tflops usage percentile that will be used as a base for tflops target recommendation. Default: 0.9
129-
TargetTflopsPercentile string `json:"targettflopspercentile,omitempty"`
134+
// Tflops usage percentile that will be used as a base for tflops target recommendation. Default: 0.95
135+
TargetComputePercentile string `json:"targetComputePercentile,omitempty"`
130136

131137
// Tflops usage percentile that will be used for the lower bound on tflops recommendation. Default: 0.5
132-
LowerBoundTflopsPercentile string `json:"lowerboundtflopspercentile,omitempty"`
138+
// When QoS is low or medium, request set to lower bound
139+
LowerBoundComputePercentile string `json:"lowerBoundComputePercentile,omitempty"`
133140

134-
// Tflops usage percentile that will be used for the upper bound on tflops recommendation. Default: 0.95
135-
UpperBoundTflopsPercentile string `json:"upperboundtflopspercentile,omitempty"`
141+
// Tflops usage percentile that will be used for the upper bound on tflops recommendation. Default: 0.99
142+
// Limit will be set to upper bound, when QoS is critical, also set limit request to upper bound
143+
UpperBoundComputePercentile string `json:"upperBoundComputePercentile,omitempty"`
136144

137-
// Vram usage percentile that will be used as a base for vram target recommendation. Default: 0.9
138-
TargetVramPercentile string `json:"targetvrampercentile,omitempty"`
145+
// Vram usage percentile that will be used as a base for vram target recommendation. Default: 0.95
146+
// The requests will be set to match this percentile of the actual usage, but won't change when current requests is in lower and upper bounds
147+
// When QoS is high, set request to target
148+
TargetVRAMPercentile string `json:"targetVRAMPercentile,omitempty"`
139149

140150
// Vram usage percentile that will be used for the lower bound on vram recommendation. Default: 0.5
141-
LowerBoundVramPercentile string `json:"lowerboundvrampercentile,omitempty"`
151+
LowerBoundVRAMPercentile string `json:"lowerBoundVRAMPercentile,omitempty"`
142152

143-
// Vram usage percentile that will be used for the upper bound on vram recommendation. Default: 0.95
144-
UpperBoundVramPercentile string `json:"upperboundvrampercentile,omitempty"`
153+
// Vram usage percentile that will be used for the upper bound on vram recommendation. Default: 0.99
154+
UpperBoundVRAMPercentile string `json:"upperBoundVRAMPercentile,omitempty"`
145155

146156
// Fraction of usage added as the safety margin to the recommended request. Default: 0.15
147-
RequestMarginFraction string `json:"requestMarginFraction,omitempty"`
157+
MarginFraction string `json:"marginFraction,omitempty"`
148158

149-
// The time interval used for computing the confidence multiplier for the lower and upper bound. Default: 24h
150-
ConfidenceInterval string `json:"confidenceInterval,omitempty"`
159+
// Only when the difference between the recommended request and the current request is greater than this threshold, the request will be updated. Default: 0.1
160+
// This value can't greater than MarginFraction, otherwise no update will be made since always inside the threshold after multiplying MarginFraction.
161+
UpdateThreshold string `json:"updateThreshold,omitempty"`
151162

152-
// How much time back TSDB have to be queried to get historical metrics. Default: 1d
153-
HistoryLength string `json:"historyLength,omitempty"`
163+
// How much time back TSDB have to be queried to get historical metrics. Default: 2h
164+
HistoryDataPeriod string `json:"historyDataPeriod,omitempty"`
154165

155-
// Resolution at which TSDB is queried for historical metrics. Default: 1m
156-
HistoryResolution string `json:"historyResolution,omitempty"`
157-
}
166+
// Min scaling ratio to original resources, e.g. request 10Gi, ratio 0.5, scale down limit to 5Gi, default: 0.2
167+
MinVRAMResourcesRatio string `json:"minVRAMResourcesRatio,omitempty"`
158168

159-
// A typical autoLimits algorithm could be checking every 5m, look back 1 day data,
160-
// select 99% of actual usage as preferredLimits,
161-
// calculate finalPreferredLimits, which is preferredLimits*(1+extraBufferRatio)
162-
// if they are equal with each other within a range (eg. 5%), do nothing
163-
// if finalPreferredLimits is less than current limits and exceeded error range,
164-
// set current limits to finalPreferredLimits
165-
// if finalPreferredLimits > current limits and exceeded error range,
166-
// set current limits to max(finalPreferredLimits, current limits * scaleUpStep)
167-
// if AI prediction enabled, it helps to detect history pattern, and set more reasonable, explainable limit value
168-
// the final set limits should be max(finalPreferredLimits, last(predict_value * (1 + extraTFlopsBufferRatio)))
169-
type AutoSetLimits struct {
170-
Enable bool `json:"enable,omitempty"`
169+
// Max scaling ratio to original resources, e.g. request 10Gi, ratio 2.0, scale up limit to 20Gi, default: 5.0
170+
MaxVRAMResourcesRatio string `json:"maxVRAMResourcesRatio,omitempty"`
171171

172-
// target resource to scale limits, such as "tflops", "vram", or "all" by default
173-
TargetResource string `json:"targetResource,omitempty"`
172+
// Min scaling ratio to original resources, e.g. request 10Gi, ratio 0.5, scale down limit to 5Gi, default: 0.1
173+
// This ratio only apply to tflops/compute request rather than limit, to avoid performance downgrade when not used for a long time
174+
MinComputeResourcesRatio string `json:"minComputeResourcesRatio,omitempty"`
174175

175-
EvaluationPeriod string `json:"evaluationPeriod,omitempty"`
176+
// Max scaling ratio to original resources, e.g. request 10Gi, ratio 2.0, scale up limit to 20Gi, default: 10.0
177+
MaxComputeResourcesRatio string `json:"maxComputeResourcesRatio,omitempty"`
176178

177-
ExtraTFlopsBufferRatio string `json:"extraTFlopsBufferRatio,omitempty"`
179+
// When workload is created, wait for this period to collect enough metrics before scaling, default: 30m
180+
InitialDelayPeriod string `json:"initialDelayPeriod,omitempty"`
178181

179-
IgnoredDeltaRange string `json:"ignoredDeltaRange,omitempty"`
182+
// How often to evaluate the scaling operation, default: same as global config's auto scaling interval
183+
Interval string `json:"interval,omitempty"`
184+
}
180185

181-
ScaleUpStep string `json:"scaleUpStep,omitempty"`
186+
type ScalingTargetResource string
182187

183-
// the multiplier of requests, to avoid limit set too high, like 5.0
184-
MaxRatioToRequests string `json:"maxRatioToRequests,omitempty"`
188+
const (
189+
ScalingTargetResourceCompute ScalingTargetResource = "compute"
190+
ScalingTargetResourceVRAM ScalingTargetResource = "vram"
191+
ScalingTargetResourceAll ScalingTargetResource = "all"
192+
)
185193

186-
Prediction *SmartSchedulerModelInput `json:"prediction,omitempty"`
187-
}
194+
type ExternalScalerConfig struct {
195+
Enable bool `json:"enable,omitempty"`
188196

189-
// To handle burst traffic, scale up in short time (this feature requires GPU context migration & replication, not available yet)
190-
type AutoSetReplicas struct {
191-
Enable bool `json:"enable,omitempty"`
192-
TargetTFlopsOfLimits string `json:"targetTFlopsOfLimits,omitempty"`
193-
EvaluationPeriod string `json:"evaluationPeriod,omitempty"`
194-
ScaleUpStep string `json:"scaleUpStep,omitempty"`
195-
ScaleDownStep string `json:"scaleDownStep,omitempty"`
196-
ScaleUpCoolDownTime string `json:"scaleUpCoolDownTime,omitempty"`
197-
ScaleDownCoolDownTime string `json:"scaleDownCoolDownTime,omitempty"`
198-
}
197+
URL string `json:"url,omitempty"`
199198

200-
type AutoSetRequests struct {
201-
Enable bool `json:"enable,omitempty"`
199+
// API key will be set into the request header as "Authorization: Bearer <api key>"
200+
APIKeySecretRef *v1.SecretReference `json:"apiKeySecretRef,omitempty"`
202201

203-
// target resource to scale requests, such as "tflops", "vram", or "all" by default
204-
TargetResource string `json:"targetResource,omitempty"`
202+
InitialDelayPeriod string `json:"initialDelayPeriod,omitempty"`
203+
204+
// How often to evaluate the scaling operation, default: same as global config's auto scaling interval
205+
Interval string `json:"interval,omitempty"`
206+
}
207+
208+
type ExternalScalerRequest struct {
209+
WorkloadName string `json:"workloadName,omitempty"`
210+
Namespace string `json:"namespace,omitempty"`
211+
CurrentResources Resources `json:"currentResources,omitempty"`
212+
}
205213

206-
PercentileForAutoRequests string `json:"percentileForAutoRequests,omitempty"`
214+
type ExternalScalerResponse struct {
215+
NeedScaleUp bool `json:"needScaleUp,omitempty"`
216+
NeedScaleDown bool `json:"needScaleDown,omitempty"`
207217

208-
// the request buffer ratio, for example actual usage is 1.0, 10% buffer will be 1.1 as final preferred requests
209-
ExtraBufferRatio string `json:"extraBufferRatio,omitempty"`
218+
// Explain why the scaling operation is needed or not needed, recorded to event and workload status
219+
Reason string `json:"reason,omitempty"`
210220

211-
EvaluationPeriod string `json:"evaluationPeriod,omitempty"`
212-
AggregationPeriod string `json:"aggregationPeriod,omitempty"`
213-
Prediction SmartSchedulerModelInput `json:"prediction,omitempty"`
221+
// If no scaling operation needed, this could be zero value
222+
RecommendedResources Resources `json:"recommendedResources,omitempty"`
214223
}
215224

216225
type AutoFreezeAndResume struct {

api/v1/workloadprofile_types.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ type WorkloadProfileSpec struct {
7979
// +optional
8080
// AutoScalingConfig configured here will override Pool's schedulingConfig
8181
// This field can not be fully supported in annotation, if user want to enable auto-scaling in annotation,
82-
// user can set tensor-fusion.ai/auto-resources|replicas: 'true'
82+
// user can set tensor-fusion.ai/autoscale: 'true'
8383
AutoScalingConfig AutoScalingConfig `json:"autoScalingConfig,omitempty"`
8484

8585
// +optional

0 commit comments

Comments
 (0)