-
Notifications
You must be signed in to change notification settings - Fork 33
Description
Project summary
llm-d is a Kubernetes-native high-performance distributed LLM inference framework
Project description
llm-d is a high-performance distributed inference serving stack optimized for production deployments on Kubernetes. It helps organizations enable the fastest time to state-of-the-art (SOTA) performance for key open-source large language models across major hardware accelerators and infrastructure providers through well-tested guides, reproducible workflows, and real-world benchmarks.
Modern LLM inference introduces new workload challenges: managing KV cache state at scale, balancing prefill and decode phases, multi-node configurations, maintaining low latency under multi-tenant traffic, and efficiently utilizing heterogeneous accelerators. llm-d builds on vLLM and Kubernetes Gateway API Inference Extension (GAIE), adding intelligent inference scheduling, prefix-cache-aware routing, prefill/decode disaggregation, hierarchical KV offloading across GPU/CPU/storage tiers, and traffic- and hardware-aware autoscaling. Together, these capabilities deliver a modular architecture for high-performance serving across diverse hardware, including NVIDIA, AMD, Intel, and Google TPU accelerators.
At its core, llm-d follows a “well-lit paths” philosophy: validated, production-ready deployment patterns that consistently deliver sustained performance on real accelerator topologies. Benchmarked recipes—such as prefill/decode disaggregation and wide expert parallelism—are tested end-to-end under realistic load and packaged with reproducible workflows, ensuring measurable improvements in TTFT, TPOT, and sustained throughput while accelerating time to production-grade distributed inference.
As a vendor-neutral, Apache 2.0 project built on cloud-native primitives, llm-d treats distributed inference as a first-class cloud-native workload and fosters an open ecosystem for scalable, portable, and production-grade generative AI infrastructure.
Org repo URL (provide if all repos under the org are in scope of the application)
Project repo URL in scope of application
https://github.com/llm-d/llm-d
Additional repos in scope of the application
All repos under both llm-d and llm-d-incubation listed here for clarity.
llm-d
https://github.com/llm-d/llm-d-workload-variant-autoscaler
https://github.com/llm-d/llm-d-kv-cache
https://github.com/llm-d/llm-d-benchmark
https://github.com/llm-d/llm-d-inference-scheduler
https://github.com/llm-d/llm-d-inference-sim
https://github.com/llm-d/llm-d.github.io
https://github.com/llm-d/llm-d-pd-utils
https://github.com/llm-d/llm-d-prism
https://github.com/llm-d/llm-d-infra
https://github.com/llm-d/llm-d-python-template
https://github.com/llm-d/llm-d-go-template
https://github.com/llm-d/llm-d-deployer
https://github.com/llm-d/llm-d-model-service
llm-d-incubation
https://github.com/llm-d-incubation/llm-d-fast-model-actuation
https://github.com/llm-d-incubation/batch-gateway
https://github.com/llm-d-incubation/llm-d-async
https://github.com/llm-d-incubation/llm-d-modelservice
https://github.com/llm-d-incubation/secure-inference
https://github.com/llm-d-incubation/llm-d-ci
https://github.com/llm-d-incubation/llm-d-infra
https://github.com/llm-d-incubation/ig-wva
https://github.com/llm-d-incubation/hermes
Website URL
Roadmap
Roadmap context
The roadmap and overall direction of the project is contained within the original project proposal
Contributing guide
https://github.com/llm-d/llm-d/blob/main/CONTRIBUTING.md
Code of Conduct (CoC)
https://github.com/llm-d/llm-d/blob/main/CODE_OF_CONDUCT.md
Adopters
https://github.com/llm-d/llm-d/blob/main/ADOPTERS.md
Maintainers file
https://github.com/llm-d/llm-d/blob/main/MAINTAINERS.md
Security policy file
https://github.com/llm-d/llm-d/blob/main/SECURITY.md
Standard or specification?
Full details can be found in the project proposal, including links to the design and north star documents
Business product or service to project separation
Multiple commercial offerings, e.g. Red Hat AI, Google Cloud, include llm-d as "a" downstream product(not solely). To ensure clear separation between the community project and commercial products, we are committed to the following:
Neutral Governance: The project's direction is determined by a diverse set of maintainers and contributors from across the ecosystem, including Red Hat, IBM, Google, and others.
Open Development: All development occurs transparently on GitHub. The roadmap is community-driven and focused on upstream alignment.
Standardization: The project’s primary goal is to establish "well-lit paths" for distributed inference that are vendor-neutral and hardware-agnostic. This ensures that the core technology remains a portable capability of the cloud-native stack, independent of any specific commercial product lifecycle or proprietary feature set.
Distinct Branding and Community: llm-d maintains its own identity, documentation, and community engagement channels (such as the llm-d.ai community site, llm-d slack workspace, etc) to differentiate the open-source project from downstream enterprise distributions.
Why CNCF?
Why do you want to contribute the project to the CNCF? What value does being part of the CNCF provide the project?
We seek to contribute llm-d to the CNCF to further a vendor-neutral foundation that fosters broad industry collaboration and standardization for distributed AI inference. While Kubernetes has become the de facto operating system for AI, serving Large Language Models (LLMs) at scale introduces unique infrastructure challenges that require a community-driven approach rather than a vendor-siloed one. By moving to a "project-centric" governance model, we aim to reassure adopters and attract contributors from across the ecosystem, already supported by a multi-vendor coalition including Red Hat, IBM, Google, CoreWeave, NVIDIA, and AMD. Ultimately, our goal is to make production-grade generative AI as ubiquitous and standardized as Linux by providing a "well-lit path" for high-performance serving on any cloud or hardware accelerator.
What value does being part of the CNCF provide the project?
Joining the CNCF provides llm-d with the institutional trust and visibility necessary to scale from an experimental framework to a standard used by leading AI research organizations, large foundation model builders, AI native companies, and traditional enterprises. The foundation’s rigorous standards for legal hygiene, IP management, and security, including alignment with OpenSSF Best Practices, provide the assurance required for deployment in mission-critical environments.
Furthermore, being part of the CNCF enables deep technical interoperability and alignment with foundational projects like the Gateway API Inference Extension (GAIE), KServe, and Envoy. This community-backed integration allows the project to address the conflicting computational demands of inference phases—specifically the bottlenecks between prompt processing and token generation—through standard, portable APIs.
Access to the CNCF’s diverse community will accelerate our mission to be a well-lit path for anyone to serve LLMs at scale, offering the fastest time-to-value and competitive performance per dollar for most models across a diverse and comprehensive set of hardware accelerators.
Benefit to the landscape
Adding llm-d to the CNCF landscape provides a standardized, production-ready framework for managing the unique computational demands of distributed Large Language Model (LLM) inference. While Kubernetes has become the industry standard for orchestration, it was not originally designed for stateful, latency-sensitive AI workloads where request cost varies dramatically based on prompt length, cache locality, and model phase. By providing a pre-integrated, Kubernetes-native stack, llm-d significantly lowers the bar for the majority of CNCF users to achieve efficient and reliable production serving. Traditional service routing and autoscaling mechanisms are unaware of inference state, leading to inefficient placement, cache fragmentation, and unpredictable latency under load. By disaggregating these inference phases into independently scalable pods and utilizing the Endpoint Picker (EPP) for intelligent, model-aware routing programmable, inference-aware endpoint selection, llm-d introduces model- and state-aware routing policies that better align request placement with accelerator characteristics and workload dynamics. This enables improved hardware utilization across diverse accelerators (NVIDIA, AMD, TPU, and XPU) and supports measurable gains in Time to First Token (TTFT) and sustained throughput under strict service level objectives.
The project further benefits the ecosystem by establishing "well-lit paths" which are proven, replicable blueprints that transform AI infrastructure from bespoke and fragile "black boxes" into manageable, cloud-native microservices. As a primary implementation aligned with the Kubernetes Gateway API Inference Extension (GAIE), llm-d incorporates the Endpoint Picker (EPP) which extends the GAIE with programmable, inference-aware endpoint selection and drives upstream API alignment. This approach standardizes critical serving capabilities such as cache-aware routing and policy-driven scheduling while remaining vendor-neutral. By building on open APIs and extensible gateway primitives, llm-d helps prevent infrastructure lock-in and ensures that high-performance AI serving remains a core, portable capability of the cloud-native stack.
Cloud native 'fit'
llm-d functions as a specialized orchestration layer for Large Language Model workloads, bridging the gap between high-level control planes like KServe and low-level inference engines like vLLM. By utilizing Kubernetes-native primitives like LeaderWorkerSet (LWS), the project transforms complex distributed inference into a manageable, scalable, and observable cloud-native workload. This allows organizations to move away from monolithic serving toward a modular architecture that separates request scheduling, model replication, and hierarchical cache management.
Within the CNCF landscape, the project provides the technical foundation for "inference-aware" traffic management via the Endpoint Picker (EPP), aligning with recent developments in the Kubernetes Gateway API ecosystem to optimize AI-specific routing. It specifically addresses the resource-utilization asymmetry between prompt processing and token generation, enabling independent scaling and optimization for each phase across diverse hardware accelerators. By driving alignment with upstream standards, llm-d ensures that high-performance AI capacity remains a core, vendor-neutral capability of the cloud-native stack.
Cloud native 'integration'
llm-d is designed as a modular orchestration layer that depends on several foundational CNCF projects, most notably Kubernetes, which serves as its primary infrastructure orchestrator and workload control plane. A critical technical dependency is the Gateway API Inference Extension (GAIE); llm-d acts as a primary implementation for this standard, leveraging Envoy’s external processing (ext-proc) capabilities to enable inference-aware request scheduling. While it utilizes the vLLM engine for model execution, the project transforms these engines into manageable cloud-native workloads by using LeaderWorkerSet (LWS) to orchestrate complex multi-node replicas and expert parallelism.
The project further complements high-level AI control planes like KServe by acting as a specialized data-plane extension, integrating via the LLMInferenceService custom resource to support advanced features like disaggregated serving and prefix caching. It provides the technical foundation for observability in AI workloads, exporting specialized metrics such as Time to First Token (TTFT) and KV cache hit rates to Prometheus and Grafana. Additionally, llm-d has partnered with LeaderWorkerSet and Kueue to be the first ecosystem projects to support scheduling of multi-node replicas and topology aware scheduling for optimal high performance networking in expert-parallel serving and disaggregation. As a result, Kueue is now better able to support advanced accelerator pooling between training and inference, creating a cohesive platform that allows Kubernetes to serve as a unified substrate for demanding generative AI tasks.
Cloud native overlap
llm-d overlaps with some foundational and emerging CNCF projects that address model serving and resource orchestration, most notably KServe, Volcano, Kueue, and KAITO.
The llm-d project overlaps with Volcano (Incubating) through its sub-project, Kthena, which is also designed for high-performance Large Language Model inference routing and scheduling. While Kthena leverages Volcano’s established expertise in gang scheduling and serving groups, llm-d utilizes standard Kubernetes primitives like LeaderWorkerSet (LWS) to orchestrate multi-node scalability and disaggregated serving.
llm-d overlaps with KAITO (Sandbox), a toolchain operator that automates the deployment of model inference services and their underlying compute. However, llm-d makes no opinion about how model servers are deployed and instead provides clear examples and guidelines to enable organizations to properly configure their own deployments. Finally, while generic benchmarking tools exist, the Prism component of llm-d provides a specialized performance analysis framework specifically for the visualization and comparison of domain specific inference benchmarks of distributed disaggregated inference stacks.
Similar projects
Some external projects address similar distributed inference challenges, most notably AIBrix and NVIDIA Dynamo. AIBrix is an open source platform from ByteDance that provides a comprehensive control plane for vLLM and other engines. It features specialized container lifecycle management through its StormService for prefill and decode disaggregation as well as high density LoRA management for long tail scenarios. While AIBrix offers an opinionated and integrated serving platform, llm-d differentiates itself by adapting to a user's existing infrastructure and prioritizing the standardization of upstream Kubernetes APIs. Similarly, NVIDIA Dynamo is a high performance framework optimized for hyperscale deployments on NVIDIA infrastructure. It supports multi-node disaggregated serving and offers distinct deployment modes, including a Kubernetes native option using LeaderWorkerSet. Unlike Dynamo, llm-d focuses on achieving dynamic efficiency at runtime based on changing traffic patterns and aligns with the vendor neutral standards of the Gateway API.
Other relevant projects include SGLang and the vLLM Production Stack. SGLang provides a set of components that have similar functionality to parts of llm-d, such as the SGLang router supporting prefix cache awareness, but which lack multi-workload fairness and policy alongside that awareness. The vLLM Production Stack serves as a reference implementation for integrating vLLM in production and provides foundational components for request routing and KV cache offloading via LMCache. Additionally, projects like DeepSpeed-MII and SkyPilot offer complementary capabilities. DeepSpeed-MII focuses on low latency model execution through optimized system libraries while SkyPilot provides multi cloud scheduling and GPU provisioning. While these projects share the goal of optimizing Large Language Model serving, llm-d is unique in its focus on creating a modular, non opinionated orchestration layer that bridges the gap between these specialized engines and standard cloud native infrastructure.
Landscape
No
Trademark and accounts
- If the project is accepted, I agree to donate all project trademarks and accounts to the CNCF
IP policy
- If the project is accepted, I agree the project will follow the CNCF IP Policy
Will the project require a license exception?
N/A
Project "Domain Technical Review"
No
Application contact email(s)
Carlos Costa, chcost@us.ibm.com
Robert Shaw, robshaw@redhat.com
Clayton Coleman, claytoncoleman@google.com
Contributing or sponsoring entity signatory information
| Name | Country | Email address |
|---|---|---|
| Cara Delia | USA | cdelia@redhat.com |
| Pete Cheslock | USA | pchesloc@redhat.com |
CNCF contacts
Karena Angell, kangell@redhat.com
Additional information
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status