You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: autoscale refactor, support multi rules and external scaler (#470)
* fix: dynamic auto scale eval interval
* fix: autoscale refactor, support multi rules and external scaler
* fix: autoscale unit test issue
* fix: autoscaler refactor
* fix: autoscale unit test issues
* fix: unit test issue
* fix: unit test issue
* fix: simplify tests
* fix: lint issue
TensorFusion is building large scale heterogeneous GPU pooling and scheduling AI infra using cloudnative ecosystem projects libs, help enterprise save GPU costs, simplify O&M and increase observability, boost elasticity.
7
+
8
+
Underlying tech: in this repo: Kubebulder, Scheduler, CDI. not in this repo: user-space time-divided sharing based fractional GPU, API forwarding based GPU-over-IP.
9
+
10
+
Critical Modules:
11
+
- pod mutating webhook for augment user pods, add needed inputs and outputs
12
+
- advanced scheduler with allocator/GPU-resource vertical scaler/bin-packing/rebalancer/quotas
13
+
- custom resource operator, GPU cluster -> pool -> gpunode -> gpu, gpunodeclaim -> node -> gpunode, maintain resources and TensorFusion components status, eval alerts etc.
14
+
- hypervisor, works like kubelet, reconcile TensorFusion workers on each gpu node, discover and bin devices, multi-process priority and autoFreeze handlers, produce metrics etc.
15
+
- server, for offering API to assign remote vGPU worker, expose system debug endpoints
16
+
- cloud provider integration (direct integration or with karpenter).
17
+
- indexallocator is a special module to resolve CDI device plugin Allocate interface can not get Pod info issue, without CDI container -> Pod matching, not possible to get advanced allocation info (hack before k8s DRA deployed). using dummy resource name and number to compose a special index pass to hypervisor. this is not general device plugin patter, need remember this context only when changing device allocation and device plugin related functions.
18
+
19
+
# Requirements
20
+
21
+
You are professional cloudnative and AI infra engineer. High quality, robust codes with Golang and k8s best practices.
22
+
Confirm the plan, then write code.
23
+
Always be user-centric, think the whole user workflow and scenario and how a AI inference/training app running on this system for every task, no hidden logic, concise and strong type definition
24
+
Define fields are in @api/v1 package, always think best data structure when CRD changes are needed.
25
+
Don't abstract too much nor abstract nothing, extract interface based on business understanding, don't extract interface when not needed.
26
+
extract function when its larger than 50-80 lines, otherwise prefer simple single function for one responsibility of codes.
27
+
use modern latest golang features, eg any rather than interface{}, generic typing if needed etc.
28
+
Never reinvent wheels, think how kubernetes source codes and kubernetes SIGs do, leverage utils and constants packages and introduced dependencies.
29
+
Always prioritize security, scalability, and maintainability.
Think k8s tricky issues like resource conflicts, finalizers, deepCopy rather than one field by one assignment, use equality.semantic.DeepEqual rather than hard code comparing.
32
+
Never write large task at once, break to smaller ones.
33
+
Only write necessary comments, e.g for some complex algorithm and background info, never write stupid comment.
34
+
Always remember to add events by kubernetes event recorder and logs for KEY code paths, which are important for user observability and troubleshooting, but events should not be too many.
35
+
Always test-driven, write ginkgo based test cases, continue to run go/ginkgo test commands, review codes and refactor until test works, if test not work or perform, continue.
36
+
When the task introduce some new memory state, consider expose it to server module for troubleshooting
// Only when the difference between the recommended request and the current request is greater than this threshold, the request will be updated. Default: 0.1
160
+
// This value can't greater than MarginFraction, otherwise no update will be made since always inside the threshold after multiplying MarginFraction.
0 commit comments