Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
4a29728
fix: extract limiter and accelerator to c ABI
Code2Life Nov 17, 2025
30cc96e
fix: bump deps, inject-container convert from limits
Code2Life Nov 18, 2025
94c3576
Revert "fix: bump deps, inject-container convert from limits"
Code2Life Nov 18, 2025
d434888
feat: add device controller
Code2Life Nov 18, 2025
11acd70
fix: refactor hypervisor
Code2Life Nov 19, 2025
9d95e57
feat: partitioned scheduling
Code2Life Nov 19, 2025
450858d
fix: support partition allocation in scheduler
Code2Life Nov 20, 2025
ec58d18
fix: lint issues
Code2Life Nov 20, 2025
421a93e
fix: unit test issues
Code2Life Nov 20, 2025
02f4359
chore: lint
Code2Life Nov 20, 2025
b2b8a7b
fix: optimize wording
Code2Life Nov 20, 2025
c6f73cc
fix: update cr info
0x5457 Nov 20, 2025
16e51a0
fix: unit test issues
0x5457 Nov 20, 2025
99c485b
fix: update readme
Code2Life Nov 20, 2025
2777027
fix: hypervisor debug and public manifests
Code2Life Nov 20, 2025
219bae5
fix: optimize hypervisor pod watcher
Code2Life Nov 21, 2025
2fda5a9
fix: partition mode issues, refactor hypervisor
Code2Life Nov 23, 2025
923676b
fix: compile issues
Code2Life Nov 23, 2025
3622772
fix: tui issue
Code2Life Nov 23, 2025
878e6aa
fix: hypervisor refactor
Code2Life Nov 23, 2025
d3342ef
fix: lint issue
Code2Life Nov 23, 2025
c4c0cde
fix: optimize typing
Code2Life Nov 26, 2025
1032869
fix: optimize hypervisor
Code2Life Nov 27, 2025
f6a0539
fix: bump deps
Code2Life Nov 28, 2025
57be41f
fix: hypervisor name mismatch and test case issue
Code2Life Nov 28, 2025
5bcf96a
fix: karpenter permission issue
Code2Life Dec 1, 2025
12750de
fix: pod index split
Code2Life Dec 2, 2025
e669edd
fix: refactor hypervisor
Code2Life Dec 3, 2025
22b3c17
fix: support heterogeneous devices, add telemetry
Code2Life Dec 5, 2025
40df300
fix: index queue issue
Code2Life Dec 5, 2025
7ad96fc
fix: unit test
Code2Life Dec 5, 2025
c3ce8bb
fix: unit test
Code2Life Dec 5, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ jobs:
build-args: |
GO_LDFLAGS=-X 'github.com/NexusGPU/tensor-fusion/internal/version.BuildVersion=${{ needs.release.outputs.version }}'

publish_node_discovery_image:
publish_hypervisor_image:
needs:
- release
if: needs.release.outputs.published == 'true' || github.event_name == 'workflow_dispatch'
Expand All @@ -95,7 +95,7 @@ jobs:
- id: meta
uses: docker/metadata-action@v5
with:
images: tensorfusion/tensor-fusion-node-discovery
images: tensorfusion/tensor-fusion-hypervisor
tags: ${{ github.event_name == 'workflow_dispatch' && steps.set_tag.outputs.tag || format('type=semver,pattern={{{{version}}}},value={0}', needs.release.outputs.version) }}

- name: Login to DockerHub
Expand All @@ -104,12 +104,14 @@ jobs:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_PASSWORD }}

- name: Build and push node discovery
- name: Build and push hypervisor
uses: docker/build-push-action@v6
with:
context: .
push: true
file: dockerfile/node-discovery.Dockerfile
file: dockerfile/hypervisor.Dockerfile
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
no-cache: true
build-args: |
GO_LDFLAGS=-X 'github.com/NexusGPU/tensor-fusion/internal/version.BuildVersion=${{ needs.release.outputs.version }}'
11 changes: 10 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -40,4 +40,13 @@ __debug*
vendor
logs

*.prof
*.prof

provider/build

cmd/hypervisor/hypervisor
*.o

_obj

metrics.log
14 changes: 9 additions & 5 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
Expand Up @@ -21,15 +21,18 @@
]
},
{
"name": "Debug Discovery",
"name": "Debug Hypervisor",
"type": "go",
"request": "launch",
"mode": "auto",
"console": "integratedTerminal",
"env": {
"HOSTNAME": "mocknode",
"KUBECONFIG": "~/.kube/config",
"KUBECONFIG": "~/.kube/config-local-studio",
"HYPERVISOR_PORT": "8042",
"GPU_NODE_NAME": "ubuntu",
},
"program": "${workspaceFolder}/cmd/nodediscovery/main.go",
"cwd": "${workspaceFolder}",
"program": "${workspaceFolder}/cmd/hypervisor/main.go",
},
{
"name": "Debug Dev Env Operator",
Expand Down Expand Up @@ -62,7 +65,8 @@
"ENABLE_WEBHOOKS": "false",
"ENABLE_SCHEDULER": "true",
"ENABLE_CR_CONTROLLER": "true",
"NVIDIA_OPERATOR_PROGRESSIVE_MIGRATION": "true"
"NVIDIA_OPERATOR_PROGRESSIVE_MIGRATION": "true",
"IMPERSONATE_SERVICE_ACCOUNT": "system:serviceaccount:tensor-fusion-sys:tensor-fusion-sys"
},
"args": [
"--metrics-path", "${workspaceFolder}/logs/metrics.log",
Expand Down
55 changes: 53 additions & 2 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,21 +9,29 @@
"AMDCDNA",
"AMDRDNA",
"apierrors",
"apiextensions",
"apimachinery",
"apimachineryruntime",
"apiruntime",
"apiserver",
"apiutil",
"appsv",
"automount",
"AWSGPU",
"batchv",
"Biren",
"bubbletea",
"BUILDPLATFORM",
"buildx",
"burstable",
"Cambricon",
"CDNA",
"Cerebras",
"certgen",
"certificaterequests",
"certmanager",
"CFLAGS",
"charmbracelet",
"clientcmd",
"clientcmdapi",
"clientgoscheme",
Expand All @@ -45,27 +53,36 @@
"datanode",
"deepcopy",
"defaultbinder",
"deviceplugin",
"dylib",
"eastus",
"envtest",
"essd",
"Eventf",
"eventhandlers",
"evictable",
"featuregate",
"finalizer",
"Finalizers",
"frameworkruntime",
"fsnotify",
"FULLTEXT",
"GOARCH",
"GOBIN",
"goconst",
"gocyclo",
"goerrors",
"golangci",
"golint",
"Gomega",
"gonic",
"GOPATH",
"gopsutil",
"gorm",
"gosec",
"GPGPU",
"gpuallocator",
"GPUIDs",
"gpunode",
"gpunodeclaim",
"gpunodeclaims",
Expand All @@ -86,8 +103,11 @@
"imageutils",
"indexallocator",
"influxdata",
"Infof",
"internalcache",
"internalqueue",
"intstr",
"IVSHMEM",
"jsonpatch",
"karpenter",
"karpv",
Expand All @@ -99,9 +119,14 @@
"kubescheduler",
"kubeschedulerconfig",
"kustomization",
"LDFLAGS",
"libaccelerator",
"libcuda",
"libnvidia",
"lineprotocol",
"lipgloss",
"LOCALBIN",
"logr",
"mapstructure",
"metav",
"metricsserver",
Expand All @@ -113,26 +138,36 @@
"nindent",
"nodeclaim",
"nodeclassref",
"nodelist",
"noderesources",
"nolint",
"NUMA",
"nvdp",
"Nvlink",
"NVML",
"objs",
"omitempty",
"onsi",
"pids",
"pluginapi",
"podname",
"portallocator",
"Postable",
"posthog",
"pprof",
"printcolumn",
"prometheusagents",
"prometheuses",
"prometheusrules",
"Ptrs",
"queuesort",
"Radeon",
"RDNA",
"readyz",
"replicaset",
"replicasets",
"rolebinding",
"RTXA",
"runbook",
"runpod",
"samber",
Expand All @@ -145,12 +180,19 @@
"schedv",
"serviceaccount",
"shirou",
"shmem",
"shortuuid",
"sqlmock",
"statefulset",
"statefulsets",
"stdbool",
"stddef",
"stdint",
"stdlib",
"strategicpatch",
"strategicpatches",
"stretchr",
"strncpy",
"subresource",
"Tabler",
"tensorfusion",
Expand All @@ -165,6 +207,8 @@
"testutil",
"tflops",
"timberio",
"Timeslicing",
"tmpfs",
"Tmpl",
"tokenreviews",
"Tolerations",
Expand All @@ -173,9 +217,16 @@
"utilerrors",
"utilruntime",
"vgpu",
"Warningf",
"webhookcorev",
"workerstate",
"workloadprofiles",
"workqueue",
"Xlarge"
]
"Xlarge",
"zapr"
],
"files.associations": {
"__locale": "cpp",
"bitset": "cpp"
}
}
20 changes: 20 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,26 @@ build: manifests generate fmt vet ## Build manager binary.
run: manifests generate fmt vet ## Run a controller from your host.
go run ./cmd/main.go

.PHONY: build-provider
build-provider: ## Build accelerator stub library.
$(MAKE) -C provider stub

.PHONY: build-hypervisor
build-hypervisor: build-provider ## Build hypervisor binary with CGO enabled.
@PROVIDER_DIR=$$(pwd)/provider; \
CGO_ENABLED=1 \
CGO_CFLAGS="-I$$PROVIDER_DIR" \
go build -o bin/hypervisor ./cmd/hypervisor

.PHONY: build-hypervisor-tui
build-hypervisor-tui:
go build -o bin/hypervisor-tui ./cmd/hypervisor-tui


.PHONY: clean-cache
clean-cache: ## Clean Go build cache.
go clean -cache -testcache

# If you wish to build the manager image targeting other platforms you can use the --platform flag.
# (i.e. docker build --platform linux/arm64). However, you must enable docker buildKit for it.
# More info: https://docs.docker.com/develop/develop-images/build_enhancements/
Expand Down
20 changes: 12 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,30 +57,34 @@ Tensor Fusion is a state-of-the-art **GPU virtualization and pooling solution**

- [x] Fractional GPU and flexible oversubscription
- [x] Remote GPU sharing with SOTA GPU-over-IP technology, less than 4% performance loss
- [x] GPU VRAM expansion and hot/warm/cold tiering
- [ ] None NVIDIA GPU/NPU vendor support
- [x] GPU VRAM expansion and hot/cold tiering
- [x] None NVIDIA GPU/NPU vendor support

### Pooling & Scheduling & Management

- [x] GPU/NPU pool management in Kubernetes
- [x] GPU-first scheduling and allocation, with single TFlops/MB precision
- [x] GPU node auto provisioning/termination
- [x] GPU-first scheduling and allocation, with 1 TFLOPs, 1% Computing, 1 MB precision
- [x] GPU node auto provisioning/termination, Karpenter integration
- [x] GPU compaction/bin-packing
- [x] Take full control of GPU allocation with precision targeting by vendor, model, device index, and more
- [x] Seamless onboarding experience for Pytorch, TensorFlow, llama.cpp, vLLM, Tensor-RT, SGlang and all popular AI training/serving frameworks
- [x] Seamless migration from existing NVIDIA operator and device-plugin stack
- [x] Centralized Dashboard & Control Plane
- [x] GPU-first autoscaling policies, auto set requests/limits/replicas
- [x] Request multiple vGPUs with group scheduling for large models
- [x] Support different QoS levels
- [x] Hardware partitioned mode isolation like NVIDIA Dynamic MIG
- [x] Support Kubernetes dynamic resource allocation (DRA) API

### Enterprise Features

- [x] GPU live-migration, snapshot and restore GPU context cross cluster
- [ ] AI model registry and preloading, build your own private MaaS(Model-as-a-Service)
- [ ] Advanced auto-scaling policies, scale to zero, rebalance of hot GPUs
- [x] Advanced auto-scaling policies, scale to zero, rebalance of hot GPUs
- [ ] Advanced observability features, detailed metrics & tracing/profiling of CUDA calls
- [ ] Monetize your GPU cluster by multi-tenancy usage measurement & billing report
- [ ] Enterprise level high availability and resilience, support topology aware scheduling, GPU node auto failover etc.
- [ ] Enterprise level security, complete on-premise deployment support
- [x] Monetize your GPU cluster by multi-tenancy usage measurement & billing report
- [x] Enterprise level high availability and resilience, support topology aware scheduling, GPU node auto failover etc.
- [x] Enterprise level security, complete on-premise deployment support
- [ ] Enterprise level compliance, SSO/SAML support, advanced audit, ReBAC control, SOC2 and other compliance reports available

### 🗳️ Platform Support
Expand Down
Loading
Loading