A March 2026 post on the Azure Kubernetes Service blog describes a demonstration of NVIDIA Dynamo running on AKS, with a KV-cache-aware router delivering roughly 20 times faster Time-To-First-Token compared with standard routing. For teams serving large language models in production, that is not a marginal improvement. It is the kind of result that can change how an application feels to users and how much it costs to run.
What KV-cache-aware routing actually does
When a user sends a prompt to an LLM, the model processes it token by token and stores intermediate attention states in a Key-Value cache. Recomputing that cache for every request is expensive, especially for long prompts or multi-turn conversations. Routing a new request to a server that already holds the relevant KV cache avoids redundant work and cuts the time before the first token is returned.
A naive router sends traffic to the least loaded instance or round-robins across replicas. A KV-aware router considers which instance already has the context needed to answer. The effect is most visible in applications with repeated or extended prompts: coding assistants, customer support agents, document analysis tools and conversational interfaces.
Why AKS matters here
Microsoft and Nvidia chose AKS for the demonstration because Kubernetes is where many enterprises already run their AI serving stacks. The result is not theoretical; it is designed to be reproducible on a platform that enterprise teams operate today. That also means the configuration, observability and scaling behaviours are familiar.
The integration points are important. Dynamo needs to know about pod health, GPU availability, cache state and request characteristics. AKS provides the orchestration layer; Dynamo provides the inference-aware routing. Together they suggest a pattern that could become standard for Kubernetes-based LLM serving.
Implications for cost, not just latency
Faster Time-To-First-Token is good for user experience, but the cost impact may be larger. Less redundant computation means lower GPU utilisation for the same throughput, or more users served on the same hardware. For high-volume AI applications, that changes the unit economics materially.
There are caveats. The 20x figure is from a specific demonstration, and real workloads vary. Applications with short, independent prompts will see less benefit than those with long shared contexts. The routing layer itself adds complexity and must be monitored. But the underlying principle is sound: inference serving should be as cache-aware as any other high-performance system.
What to try next
If you are running LLM inference on Kubernetes, this is a good time to ask:
- Are we routing requests intelligently, or just load-balancing?
- How much of our GPU time is spent recomputing KV caches we could have reused?
- Do our latency metrics distinguish Time-To-First-Token from overall response time?
- Can our serving layer expose the state needed for cache-aware routing?
The bottom line
Cache-aware inference routing is moving from research idea to production pattern. The Microsoft and Nvidia demonstration on AKS is a useful proof point for teams already invested in Kubernetes. It suggests that the next round of inference cost savings will come not from cheaper GPUs alone, but from smarter use of the ones already installed.