llm-d is a Kubernetes-native distributed inference serving stack – a well-lit path for anyone to serve large language models at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators.

With llm-d, users can operationalize GenAI deployments with a modular solution that leverages the latest distributed inference optimizations like KV-cache aware routing and disaggregated serving, co-designed and integrated with the Kubernetes operational tooling in Inference Gateway (IGW).

Built by leaders in the Kubernetes and vLLM projects, llm-d is a community-driven, Apache-2 licensed project with an open development model.

🧱 Architecture

llm-d adopts a layered architecture on top of industry-standard open technologies: vLLM, Kubernetes, and Inference Gateway.

llm-d Arch

Key features of llm-d include:

  • vLLM-Optimized Inference Scheduler: llm-d builds on IGW’s pattern for customizable “smart” load-balancing via the Endpoint Picker Protocol (EPP) to define vLLM-optimized scheduling. Leveraging operational telemetry, the Inference Scheduler implements the filtering and scoring algorithms to make decisions with P/D-, KV-cache-, SLA-, and load-awareness. Advanced teams can implement their own scorers to further customize, while benefiting from other features in IGW, like flow control and latency-aware balancing. See our Northstar design
  • Disaggregated Serving with vLLM: llm-d leverages vLLM’s support for disaggregated serving to run prefill and decode on independent instances, using high-performance transport libraries like NIXL. In llm-d, we plan to support latency-optimized implementation using fast interconnects (IB, RDMA, ICI) and throughput optimized implementation using data-center networking. See our Northstar design
  • Disaggregated Prefix Caching with vLLM: llm-d uses vLLM’s KVConnector to provide a pluggable KV cache hierarchy, including offloading KVs to host, remote storage, and systems like LMCache. We plan to support two KV caching schemes. See our Northstar design
    • Independent (N/S) caching with offloading to local memory and disk, providing a zero operational cost mechanism for offloading.
    • Shared (E/W) caching with KV transfer between instances and shared storage with global indexing, providing potential for higher performance at the cost of a more operationally complex system.
  • Variant Autoscaling over Hardware, Workload, and Traffic (🚧): We plan to implement a traffic- and hardware-aware autoscaler that (a) measures the capacity of each model server instance, (b) derive a load function that takes into account different request shapes and QoS, and (c) assesses recent traffic mix (QPS, QoS, and shapes) to calculate the optimal mix of instances to handle prefill, decode, and latency-tolerant requests, enabling use of HPA for SLO-level efficiency.
Related Posts
Who is Salesforce?
Salesforce

Who is Salesforce? Here is their story in their own words. From our inception, we've proudly embraced the identity of Read more

Salesforce Unites Einstein Analytics with Financial CRM
Financial Services Sector

Salesforce has unveiled a comprehensive analytics solution tailored for wealth managers, home office professionals, and retail bankers, merging its Financial Read more

AI-Driven Propensity Scores
AI-driven propensity scores

AI plays a crucial role in propensity score estimation as it can discern underlying patterns between treatments and confounding variables Read more

Tectonic’s Successful Salesforce Track Record
Tectonic-Ensuring Salesforce Customer Satisfaction

Salesforce Technology Services Integrator - Tectonic has successfully delivered Salesforce in a variety of industries including Public Sector, Hospitality, Manufacturing, Read more