← All briefings Briefing

Disaggregated inference: lowering token cost for agentic AI.

nvidiadynamoinferenceagentic aicost

DigitalOcean announced the availability of NVIDIA Dynamo 1.0 in March 2026, claiming 67 per cent higher GPU throughput and 79 per cent lower latency. The headline numbers are impressive, but the more important shift is the architectural idea behind them: disaggregated inference. For organisations building agentic AI systems, this approach is worth understanding.

What disaggregated inference means

Traditional inference servers typically run the prefill phase — processing the input prompt and building the KV cache — and the decode phase — generating output tokens one by one — on the same GPU. The two phases have different characteristics. Prefill is compute-intensive and bursts quickly. Decode is memory-bandwidth-bound and runs for longer.

Disaggregated inference separates these phases and places them on hardware suited to each. Prefill can run on compute-optimised configurations, while decode runs on memory-bandwidth-optimised configurations. The result is better overall utilisation, lower latency and higher throughput. Dynamo is NVIDIA’s open-source inference serving library built around this idea.

Why it suits agentic AI

Agentic AI systems do not usually send one prompt and wait for one answer. They loop: plan, call tools, receive observations, refine, generate. Each loop involves multiple model calls, often with growing context. That behaviour amplifies the inefficiency of mixed prefill-and-decode serving.

Disaggregated inference fits this pattern because it treats inference as a pipeline rather than a single operation. Agentic workloads generate many small, context-dependent requests. Separating the phases lets the infrastructure handle bursty prefill and sustained decode without either starving the other.

Reading the claims carefully

DigitalOcean’s 67 per cent throughput and 79 per cent latency improvements are specific to its setup and benchmarks. Real-world results depend on model size, batching behaviour, prompt length, concurrency and hardware configuration. The figures should be treated as directional, not guaranteed.

That said, the direction is consistent with what we are seeing elsewhere. As inference becomes the dominant AI cost, optimising how requests are served becomes more valuable than squeezing the last point of accuracy from a model.

Implications for smaller teams

DigitalOcean’s positioning matters because it brings disaggregated inference within reach of teams that do not operate their own GPU clusters. Managed platforms can offer the architectural benefits without requiring customers to tune low-level inference engines. That lowers the barrier to entry for start-ups and mid-sized businesses running agentic applications.

The trade-off is opacity. Managed services hide the knobs, which is fine until a workload behaves unexpectedly. Teams should still understand enough about prefill, decode and KV caching to diagnose latency spikes and cost anomalies.

What to do next

If you are running or planning agentic AI workloads, consider:

  • Are your inference servers treating all requests the same, or are they aware of prefill and decode differences?
  • Do your latency metrics capture per-phase behaviour, or only end-to-end response time?
  • Is your infrastructure cost scaling linearly with agent loops, or can it share context more efficiently?
  • Would a managed disaggregated inference platform reduce operational load without sacrificing visibility?

The bottom line

Disaggregated inference is moving from niche research to commercial availability. DigitalOcean’s Dynamo 1.0 launch is one signal among many that the economics of inference serving are changing. For agentic AI, where many small model calls add up quickly, that change could be the difference between a profitable product and an expensive prototype.

Related briefings

Keep reading.

More from the team

Longer thinking →

Briefings are short reads on the news. For Burt's own thinking, see the Journal.