From KV Cache to AI Memory System: The Evolution of Large Language Model Inference Architecture

From KV Cache to AI Memory System

Abstract This article addresses a seemingly dispersed but fundamentally unified question: Why has systems innovation around large language model inference over the past two years increasingly looked less like “optimizing a neural network” and more like “designing a memory system”?

If we characterize pre-2023 LLM engineering as “racing for FLOPS, Tensor Core utilization, and training throughput,” then the 2024–2026 narrative of inference has clearly shifted to another主线: KV cache, TTFT/TPOT, continuous batching, prefix reuse, prefill/decode disaggregation, hierarchical cache, CXL memory pool.

This is not a coincidence. Autoregressive inference in Transformers naturally turns “history” into a continuously growing state that must be accessed repeatedly; and RLHF/RLVR, reasoning, agent workflows, long contexts, and multi-turn interactions have pushed this KV cache state right into the center of system design. OpenRLHF even points out that in PPO-style RLHF/RLVR, the inference phase often accounts for over 90% of total runtime; meanwhile, systems like Mooncake, LMCache, Strata, KVFlow, Beluga, and TraCT have elevated KV cache to “first-class citizen” status from different perspectives. 1

Looking back today, the real change is not that “one kernel got faster,” but that the focus of inference systems has shifted from computation graphs to memory graphs. This article connects Mac/UMA, NVIDIA/NVLink, FlashAttention, vLLM, RL rollout, Mooncake, LMCache, CXL, and recent 2025–2026 research papers into a coherent narrative. 2

1. The Nature of LLM Inference: Prefill vs Decoding

Modern mainstream LLMs mostly follow the causal decoder-only Transformer route pioneered by GPT: given the context up to the current point, they output a probability distribution over the next token. This fundamentally differs from bidirectional encoder routes like BERT: GPT-style models naturally support autoregressive generation token by token, while BERT's pre-training objective is masked language modeling, which inherently depends on both left and right context for representation calculation. In other words, KV cache is naturally a first-class mechanism only for causal decoders, not for bidirectional encoders like BERT. 3

When you decompose the inference process by execution phases, you'll find that LLM serving is not a uniform workload, but rather two phases with very different physical characteristics stacked together.

The first phase is prefill. It reads the entire input prompt, performs a single forward pass over all input tokens, and generates K/V states for each layer to be reused in subsequent steps. This phase has high parallelism, a large fraction of matrix multiplications, and can fully utilize GPU Tensor Core, so it resembles the familiar “training forward pass” shape. Both Sarathi-Serve and DistServe explicitly characterize prefill as a high-latency phase that can fully saturate GPU compute resources. 4

The second phase is decoding. The model generates only one token at a time, appends it to the history, then generates the next token. Each step in this phase has relatively small operator size, so the GPU cannot easily be filled with pure compute; the real cost is that you must read the entire historical KV cache back to participate in attention calculation. Sarathi-Serve directly points out that although decode iterations have low single-iteration latency, they also have low compute utilization, so the system must rely on batching to achieve good throughput. 4

This is the first key to understanding all subsequent system design: prefill is more compute-bound, decode is more memory-bound. Of course, this is not absolute—it depends on the workload. In long-input scenarios like document summarization, RAG, and large code analysis, prefill can be extremely heavy; while in real-time chat, long-output reasoning, RL rollout, and multi-step agent execution, decode and KV cache problems become much more prominent. DistServe explicitly distinguishes different workloads such as chatbot, programming assistant, and document summary in its experiments, and emphasizes that TTFT and TPOT have different objectives. 5

Therefore, whenever someone claims “LLM inference is a compute problem” or “LLM inference is a memory problem,” you should first ask: are you talking about prefill or decode? what kind of workload is this?

2. KV Cache: The True State Variable of Transformer Inference

Looking only at the algorithmic formula, self-attention seems to be just Attention(Q, K, V) = softmax(QK^T)V. But in an inference system, the most important byproduct of this formula isn't the output—it's how historical state is stored.

In a causal decoder, every time a new token arrives, the model computes the corresponding K and V for this token at each layer and saves them; then when generating a new token, the current token's Q only needs to interact with all historical K/V. Thus, KV cache essentially stores: the Key and Value representations for every layer and every historical token. It doesn't store Q because Q only belongs to “the current step”; it also doesn't store the full attention matrix because that changes with the current Q and is even larger. This mechanism relies on the causal mask: history isn't rewritten by the future, so K/V of past tokens can be safely reused. 6

The significance of this at the system level is enormous.

Without KV cache, when generating the t-th step you would need to recompute the forward pass for the previous t-1 tokens, which means recomputing the entire history; with KV cache, you only need to do an incremental forward pass for the new token, but you still need to read the historical K/V for attention. Thus, the problem changes from “repeatedly recomputing history” to “repeatedly reading history.” Shazeer put it very directly in the Multi-Query Attention paper: the core reason incremental decoding is slow is the memory-bandwidth cost of repeatedly loading large key and value tensors. 7

Therefore, KV cache is not just an implementation detail—it's the true state variable of Transformer autoregressive inference. For a model with L layers, H_kv KV heads, dimension d per head, context length T, and batch size B, KV cache size grows roughly as O(L * H_kv * d * T * B). Once you stretch T, increase B, or split a single request into multiple trajectories, this state quickly expands. Papers from 2025 like CAKE, R-KV, and FlexiCache essentially all start by acknowledging this fact: KV cache has already become so large that it must be compressed, layered, evicted, and predicted. 8

There's also a frequently overlooked difference: model weights weights are shared and static, while KV cache grows per request and is not shared. The former is “how knowledge is stored in the model”; the latter is more like “how each inference stores its thinking process.” This directly creates a system-level watershed: for the same GPU memory footprint, weights cost primarily capacity, while KV costs capacity + bandwidth + latency. This is why I'll argue later that in many future scenarios, KV cache compression may have greater strategic value than parameter compression. 9

3. Why the Bottleneck Shifts from Compute to Memory

Many debates about LLM inference fail because they conflate different workloads. A more precise statement than “inference is increasingly偏向 decode” is:

As workloads move toward multi-turn, long-history, long-output, long CoT, and reusable context, the system bottleneck gradually shifts from prefill computation to KV cache access during the decode phase.

This is not the same as “prompts are getting longer.”

Consider the simplest chat scenario: if the user input is short, the system prompt is stable, but the output is long, then prefill happens once while decode repeats dozens or hundreds of times; in this case, the cumulative cost of decode can easily exceed prefill. Sarathi-Serve, DistServe, and OpenRLHF all base their system designs on this observation that prefill and decode have fundamentally different characteristics. 4

But this doesn't mean all workloads are this way. In tasks like RAG, long document summarization, and code repository analysis, the input may have thousands to hundreds of thousands of tokens but the output is short; the first major bottleneck for this workload is TTFT, because the model must fully prefill the long input first. The theoretical paper on prefix reuse scheduling from LinkedIn explicitly identifies “long-prompt, short-output” as a prefill-dominant regime, and points out that prefix reuse is critical for TTFT in this scenario. 10

The really interesting case is agent / reasoning / RL rollout. In these scenarios, the “context” logically grows longer: the model thinks, calls tools, reads back tool outputs, and continues thinking—history keeps getting appended. But if the system can effectively reuse KV cache, what grows isn't “the prefill computation per round,” but rather “the amount of historical state that needs to be maintained and read.” In other words, long history doesn't automatically mean heavier prefill; it can mean larger KV cache and higher decode memory pressure. OpenRLHF explicitly identifies long CoT as a key bottleneck for training efficiency, and integrates vLLM into the rollout engine to alleviate the inference burden from long reasoning chains. R-KV goes further, taking “KV cache explosion from超长 outputs” as its starting point. 1

This is why I say: “Long prompt” and “long history” are not the same thing.

A long prompt, if you have to feed it in again every time, is a prefill problem.
A long history, if it's retained as KV cache, is a decode / memory problem.

Most innovations in modern APIs and serving systems revolve precisely around this distinction: prompt caching, prefix reuse, PD disaggregation, and hierarchical cache all essentially try to convert “recomputed long prompts” into “reusable long history.” 11

4. RL / Agent: Why Post-Training Exacerbates the Decoding Bottleneck

If you only think about the SFT era, training still centers on standard forward/backward: the dataset is given, the model consumes it, computes loss, and backpropagates gradients. That's the typical era where “training is core, inference is just auxiliary.”

But post-training, especially RLHF, RLVR, reasoning-oriented RL, and agentic RL, is no longer like this. A typical PPO/GRPO-style cycle today looks more like:

Given a prompt;
The model rolls out one or more answer trajectories;
Reward model, verifier, tool executor, or environment provides feedback;
Update parameters based on these rollouts.

In this pipeline, training doesn't directly optimize over fixed samples—it first generates samples, then trains. Thus, the system's center of gravity naturally tilts toward rollout. The OpenRLHF paper makes a strong statement: in PPO-style RLHF/RLVR, inference phase often accounts for over 90% of total runtime, because the model needs to generate thousands of tokens at each inference step. 1

RLVR (Reinforcement Learning from Value Reflection), as a next-generation agentic RL method, further amplifies this problem: it requires the model to constantly reflect on and revise its trajectory during rollout, which usually results in longer trajectories than traditional RLHF, and KV cache needs to retain even more state. Recent engineering experience shows that in complex agent scenarios like terminal environments, rollout trajectory lengths can easily reach thousands of tokens, and each policy update requires sampling multiple trajectories—so the memory pressure on KV cache is an order of magnitude higher than in traditional inference scenarios. 40

Once you accept this, many otherwise puzzling system designs become straightforward. ECHO-2 explicitly proposes separating centralized learning from distributed rollout inference, allowing rollout generation to spill out of data center GPU clusters; AgentRL proposes a fully-asynchronous generation-training pipeline to support multi-turn, multi-task agent RL; OpenRLHF explicitly separates the rollout engine from the actor/training engine, and emphasizes that asynchronous data flow is especially important in the era of long CoT. 12

Why is rollout particularly “hard”? Because it simultaneously has several bad characteristics:

First, it's decode-heavy. It doesn't consume the entire input in one go—it moves forward token-by-token. 1

Second, its length is highly irregular. Within the same batch of prompts, some trajectories finish in a few steps while others unfold into very long chains-of-thought; in an agent setting, different tool calls further cause trajectory lengths to diverge. Both AgentRL and OpenRLHF treat asynchronous pipelines as a necessary design choice, not a nice-to-have. 13

Third, it naturally amplifies KV cache. If a prompt isn't just sampling one output trajectory but k trajectories, and each trajectory keeps growing, and you need to retain verifier, tool use, and multi-turn state in between, then the system doesn't just get k more computations—it gets k continuously expanding historical states. R-KV treats the verbosity of reasoning output as a core problem precisely because “output length” in reasoning models directly maps to KV cache cost. 9

My earlier conjecture—“post-training needs Mac probably because rollout takes more time than training”—was directionally correct, but a more accurate statement would be:

It's not simply that rollout takes longer—it's that rollout transforms the problem from backward-dominated to decode/KV-dominated.

This is why more and more post-training frameworks are embedding serving engines like vLLM and SGLang into the training framework: OpenRLHF explicitly identifies vLLM as key infrastructure for long CoT RLHF/RLVR; their argument isn't “let's just use an inference engine for generation”—it's that “inference itself is already the core bottleneck for training efficiency.” 1

5. Why Mac / UMA Is Gaining Relevance Again

When people talk about running large models on Mac, it's often misunderstood as “Apple GPU can compete with NVIDIA on training throughput.” This is almost never the point. The core reason Mac is becoming important again isn't that it wins on raw compute—it's that it took a different path in memory system organization.

Apple repeatedly emphasizes unified memory architecture in official materials: the M3 family describes it as “a single memory pool where all technologies on the chip can access the same data without copying between multiple memory pools”; the M2 Ultra explicitly provides 800GB/s system memory bandwidth and supports 192GB unified memory; by 2025, Apple M3 Ultra further offers up to 512GB unified memory and over 800GB/s memory bandwidth; and in March 2026, the M5 Max reaches 614GB/s unified memory bandwidth in official specifications. Apple itself even markets “running extremely large LLMs directly in device memory” as one of the key AI selling points for Mac Studio. 14

What does this mean for LLM inference? It's not “Mac's GPU is faster than H100”—it's:

Model weights, KV cache, and CPU-side orchestration can all share the same large memory pool;
Many CPU/GPU collaboration scenarios no longer require explicit copying;
When you have enough capacity, a single machine can keep a larger working set in a unified address space.

Apple's description of MLX is also straightforward: MLX leverages the unified memory architecture of Apple silicon, so no data copying is needed when CPU and GPU operate on data. 15

This directly targets the weak spot of decode-heavy / rollout-heavy workloads. Because the real difficulty of decode isn't doing a huge GEMM—it's how to cheaply read and maintain a large block of historical state. If your model, KV cache, tool runtime, and CPU/GPU orchestration all share a large memory pool, the “friction loss” at the system level becomes significantly lower. Apple's messaging has consistently focused on high bandwidth, low latency, and single-pool sharing, rather than discrete VRAM + PCIe copying. 14

But this doesn't mean “Mac is universally better than NVIDIA.” NVIDIA still dominates with overwhelming ecosystem and compute advantage for prefill, training, and high-concurrency serving. An H100-class product already has HBM bandwidth in the 3TB/s range, and the Grace Hopper GH200 further provides a CPU+GPU coherent memory model via NVLink-C2C's 900GB/s coherent interface. NVIDIA's official statement is that GH200 provides hardware-level memory coherency via NVLink-C2C. 16

In other words, NVIDIA hasn't missed this path—it's approaching it along another, more scalable route: from traditional “discrete GPU + PCIe” to “NVLink interconnect” to “Grace Hopper coherent memory model” to today's series of papers around CXL, shared memory pools, and PD disaggregation. Apple's contribution is more that it made many people experience for the first time that unified memory isn't just a mobile selling point—it's a structural advantage for LLM inference. 17

So if you ask “why would someone use Mac for post-training,” I'd offer a more cautious judgment:

At cluster-scale post-training, NVIDIA still dominates; but for local experiments, low-concurrency rollout, long-context inference, agent prototyping, and long CoT debugging—these memory-centric scenarios—Mac's UMA really does hit the key bottleneck.

This isn't “Mac is stronger than NVIDIA”—it's that when the workload shifts from FLOPS to working set, memory architecture determines the user experience. 18

6. Why FlashAttention Cannot Solve Decoding

FlashAttention is one of the most important attention kernel innovations of recent years, but it's often misused as evidence that “attention has been fully optimized.” In reality, FlashAttention solves a core pain point of prefill, not the root problem of decode.

The original FlashAttention paper states the problem very clearly: the key bottleneck for standard attention isn't theoretical FLOPs, it's IO between HBM and on-chip SRAM. It avoids materializing the large intermediate attention matrix through tiling, thereby significantly reducing HBM reads and writes. This optimization is extremely effective for long-sequence prefill, because prefill faces a large-scale N x N attention structure. 19

But decode is a completely different shape. In the t-th step of decoding, there's only one new token, so the sequence length of Q is almost 1, while the length of K/V is the historical length t-1. In this case, attention is more like a 1 x N query process, rather than requiring explicit construction of an N x N intermediate matrix. In other words, the problem for decode isn't “the intermediate attention matrix is too big”—it's that the entire historical K/V must be read once. Shazeer's MQA paper directly attributes the bottleneck of incremental decoding to repeated loading of large keys and values tensors. 19

This is why you see this very counterintuitive conclusion:

FlashAttention is often critical for prefill;
But for decode, it mostly adds incremental improvements to local kernel efficiency without changing the fundamental bottleneck.

Because decode truly can't avoid O(N) KV reads. You can reduce kernel overhead and better fuse some computations, but as long as the model still uses standard causal attention, historical state must be accessed. The reason Sarathi-Serve emphasizes batching, chunked-prefill, and stall-free scheduling instead of betting everything on attention kernel is this reality: the system bottleneck for decode is no longer the individual operator itself. 4

In March 2026, FlashAttention-4 further optimized heterogeneous computing and on-chip memory management, continuing to deliver significant improvements for long-sequence prefill, but the core observation still holds: for incremental decoding, the bottleneck is fundamentally the repeated reading of historical KV, not computation scheduling within the kernel. Even if kernel efficiency improves by an order of magnitude, as long as you need to read the full history every step, memory bandwidth will remain the core constraint.

This is the second key to understanding all subsequent system design: When the bottleneck is “must read history,” the optimization direction sinks from kernel to memory layout, cache reuse, scheduling, and architecture.

7. Four Layers of Inference Optimization: Kernel, Engine, Model, Hardware

To avoid mixing all technologies together in one pot, I prefer to divide LLM inference optimization into four layers:

Layer 1: Kernel layer

FlashAttention is the classic example here. It asks: how can a single attention / decode kernel move less data between HBM and keep more data on on-chip SRAM, to execute more efficiently? This layer is important, but it only answers “how to compute this step faster.” 19

Layer 2: Engine layer

Classic examples are Orca, vLLM, SGLang, Sarathi-Serve. This layer cares about: how are requests dynamically scheduled, how is KV cache paged/reused, how do prefill and decode cooperate, how do you improve GPU utilization and goodput? Orca changed request-level scheduling to iteration-level scheduling; vLLM uses PagedAttention to solve KV memory fragmentation; SGLang uses RadixAttention for prefix reuse; Sarathi-Serve balances throughput and latency through chunked-prefill. 20

Layer 3: Model layer

Classic examples are MQA / GQA, and the more radical Mamba / RWKV direction. This layer asks: if decode is slow because we need to read too much K/V, can we reduce or even eliminate K/V? MQA shares K/V directly; GQA is an engineering compromise between MHA and MQA; Mamba/RWKV try to compress history into a recurrent state. 7

Layer 4: Hardware layer

Classic examples are Apple UMA, Grace Hopper coherent memory, CXL memory pool. This layer asks: since history must be read, can we make “reading” more like accessing local memory instead of frequently copying across buses? Apple went with unified memory; NVIDIA brought coherent memory to CPU+GPU with GH200; Beluga, TraCT, and CXL-SpecKV attempt to introduce larger shared memory pools for cluster-scale inference. 14

If you only focus on the first layer, you'll think the problem is “attention isn't fast enough.” If you see the second layer, you'll realize the problem is “GPUs are often waiting and not fully utilized.” If you see the third layer, you'll discover that the model architecture itself determines the system bandwidth pressure. If you see the fourth layer, you'll understand: today many so-called AI infrastructure innovations are essentially memory system papers, not neural network papers.

8. Continuous Batching: From Request-Level to Token-Level Scheduling

A fundamental contradiction in LLM serving is that each decode step has little computation, so without batching it's hard to fill the GPU; but once you batch, sequence lengths and completion times are highly inconsistent, and traditional static batching creates a lot of waste.

Orca is one of the origins of this line. It proposed iteration-level scheduling: instead of running a whole batch of requests to completion before switching to the next batch, the system schedules at the iteration granularity—each time, it only lets the engine execute a single iteration, then immediately allows new requests to enter and completed requests to exit. The Orca paper clearly states the benefit: for Transformer-based generative models, this iteration-level scheduling is significantly better than traditional request-granularity serving. 20

vLLM further engineering this idea. The PagedAttention paper starts from the observation that KV cache is large and dynamically growing, so traditional contiguous allocation wastes GPU memory and makes high-throughput serving difficult. PagedAttention manages KV cache like operating system paging, enabling more flexible request assembly and memory reuse. 21

Once KV cache can be managed via paging, continuous batching truly becomes feasible. Its core isn't “accumulate more requests for one batch”—it's: rebuild the batch every decode step. Whoever finishes exits, whoever arrives new inserts as soon as possible; the batch lifetime shrinks from “the full generation of one request” to “one token step.” This is fundamentally different from static batching not in batch size, but in the finer granularity of batch reorganization. Iteration-level scheduling in Orca, continuous batching in vLLM, and in-flight batching in TensorRT-LLM all essentially do this. 20

Why does this significantly improve GPU utilization? Because single-request decode computation is so small that you need to pack the “next token” from multiple requests together to improve tensor parallelism; static batching gets held up by the longest request, and the “empty seats” left after shorter requests complete can't be immediately refilled. Continuous batching changes the serial model waiting for the longest request into a token-level pipelined model. Sarathi-Serve further points out that decode batching is particularly effective for overall throughput, but mixing prefill and decode causes stalls, so chunked-prefill is needed to reduce this interference. 4

Of course, continuous batching doesn't mean “bigger is always better.” Revisiting SLO and Goodput Metrics in LLM Serving, DistServe, and a series of subsequent works all remind us: the goal of online serving isn't simply maximum throughput—it's goodput that satisfies SLO constraints like TTFT / TPOT / tail latency. If the batch gets too large, throughput might increase slightly, but TPOT and P99 latency can degrade, and ultimately goodput actually decreases. 22

Therefore, the real goal of continuous batching isn't “make the batch infinitely large”—it's:

Within the given constraints of KV bandwidth, tail latency, and SLO, let the GPU do useful token computation every time slice.

This is already very much like online scheduling in an operating system, not like “fixed batch training” in classical deep learning.

9. How Model Structure Directly Determines System Limits: MQA / GQA

If you understand that the core cost of decode is “repeatedly reading historical K/V,” then the significance of MQA / GQA becomes very clear.

When Shazeer proposed MQA in 2019, the problem statement was already the common language of the entire industry today: incremental decoding is slow because of the memory-bandwidth cost from repeated loading of large keys and values tensors. MQA's solution is simple and brutal: let all query heads share the same set of K/V heads. This doesn't reduce the number of query heads, but it significantly reduces the size of K/V, thereby lowering bandwidth pressure during decode. 7

GQA is an engineering compromise along this line. Google's GQA paper clearly points out that MQA gives fast inference but may hurt quality; so they proposed grouped-query attention, taking a middle ground between “each head has independent K/V” and “all heads share K/V.” The paper's conclusion is also clear: well-trained GQA achieves quality close to MHA while keeping speed close to MQA. 23

The importance of this at the system level goes far beyond “a variant of attention mechanism.” Because when you reduce H_kv, you're not just reducing KV cache occupancy—you're also reducing the amount of data that must be read back per decode step. In other words:

MHA pushes the system toward the memory wall;
MQA/GQA actively reduces the per-request memory footprint;
This directly increases the achievable optimal batch size and improves the TTOT/throughput curve.

This is a classic example of “model design serving system efficiency.” It's not just for academic elegance—it trades system bandwidth through structural design. 7

From this perspective, the common separation of “model architecture” and “system optimization” into two separate worlds is already outdated. GQA isn't a pure model innovation disconnected from deployment—it's explicitly responding to the physical bottleneck of the serving era. When you look at later approaches like CAKE, R-KV, and FlexiCache, you'll see this line going even further: it's not just about making K/V smaller—it's about making K/V exist more intelligently. 8

From late 2024 to early 2025, this evolutionary path has taken several more critical steps:

MLA: Low-Rank Compressed KV in DeepSeek V3.1

Multi-Head Latent Attention (MLA) proposed by DeepSeek V3.1 pushes the idea of MQA/GQA to a new level. Instead of simply sharing K/V heads, it applies low-rank projection compression to K/V: it projects original K/V into a lower-dimensional latent space for storage, and recovers it when needed. This approach maintains model quality while compressing KV cache volume by 30–50%, significantly reducing bandwidth pressure during the decode phase. The significance of MLA is that it proves model architecture can actively serve system-level memory efficiency through representation compression, rather than passively waiting for system-level optimization.

Lightning Indexer: Incremental KV Optimization in DeepSeek V3.2

DeepSeek V3.2 further introduced Lightning Indexer, which optimizes incremental inference specifically for MLA. It changes the index structure of KV cache from global compression to incremental update—new incoming token KV can be written directly to the compressed cache without recompressing the entire sequence, thus maintaining compression ratio without adding latency overhead. This demonstrates again that model structure and memory system design must co-evolve—compression gives capacity benefits, but the engineering problem of incremental update must be solved together for it to be practical.

Attention Residuals: Hierarchical KV Retention by Kimi / MiniMax

Both Kimi and MiniMax have recently adopted similar Attention Residuals ideas in their inference optimizations: instead of keeping full-precision KV for all layers and all tokens, they only keep full KV in lower layers, and only keep residuals or incremental information in upper layers. This hierarchical compression further reduces overall KV volume, and since lower layers attention focuses more on local positions while upper layers focus more on abstract semantics, this non-uniform compression has very limited impact on model quality. It represents another direction: leveraging the hierarchical properties of the attention mechanism itself for non-uniform KV compression.

These new developments continue to validate the same direction: model architecture design is increasingly actively responding to the memory bottleneck during inference, rather than throwing all problems over to the system side. Only when the model defines the redundant structure of KV can the system do more fine-grained management on top of that structure.

10. KV Cache Is Actually Highly Redundant External Memory

If before 2023, many systems still defaulted to “KV cache is something that should be stored, we just need to fit it somewhere,” then after 2025, a growing consensus is:

KV cache is not the optimal representation—it's just the most conservative representation.

It retains all the historical state that Transformer saves to avoid information loss, but there is significant redundancy across time, layers, heads, and token dimensions.

R-KV's problem formulation is representative: reasoning models often generate extremely long chains-of-thought, which leads to “prohibitively large KV caches during inference.” The authors further point out that traditional KV compression methods often fail on reasoning models because reasoning tokens contain both truly critical reasoning states and lots of redundant tokens; through redundancy-aware compression, R-KV achieves near-full performance with only 10% KV, and can even outperform the full KV baseline with just 16% KV. 9

CAKE, from another angle, shows that redundancy is not uniformly distributed. It models KV eviction as a layer-aware, time-aware resource allocation problem: different layers have different attention patterns, and token importance changes over time. The result is quite aggressive: on LongBench and NeedleBench, CAKE maintains performance with only 3.2% KV cache, and significantly reduces decode latency under long contexts. 8

FlexiCache offers a third perspective: attention heads have different temporal stability. Some heads consistently focus on similar top-K pages, while others change more frequently. Therefore, the system doesn't need to treat all heads equally: stable heads only need to keep top-K pages on GPU, while the rest can be下沉 to host memory; unstable heads retain more GPU-resident pages. 24

When you put these three lines of work together, they jointly point to a conclusion:

Many tokens contribute sparsely;
Many layers and many heads do not contribute equally;
A lot of historical state can be compressed, evicted, lazy-loaded, or partially recomputed.

At this point, KV cache increasingly behaves like an “external memory system” rather than “fixed intermediate results.” The traditional Transformer approach essentially archives history token-by-token; but a more mature memory system asks: which should stay in the hot tier? Which can be moved to colder tiers? Which are just redundant copies? Which should be predictively prefetched? Which can simply be forgotten? 8

Therefore, I argue that in many future inference systems, KV compression may have greater strategic value than parameter quantization. Parameter compression solves “can we fit the model”; KV compression solves “can the system actually run well—fast, stable, consistently.” The results from R-KV and CAKE already prove this to a large extent. 8

11. Why Many Production Systems Still Recompute History

At this point, a natural question arises: since KV cache is so important, why haven't many production systems fully leveraged it?

The answer is: the optimal solution for systems engineering is not equal to the optimal solution for single-request computation.

At the API level, many cloud services have long adopted an almost stateless interaction model. OpenAI's Chat Completions documentation is clear: you need to provide messages in the request, which means “the conversation so far”; this requires the client to send the entire historical conversation back. Later, the Responses API added state mechanisms like previous_response_id and Conversations, allowing you to “store and retrieve conversation state across Response API calls,” and Prompt Caching enables automatic caching for exactly identical prefixes. But from a system perspective, this is still not the same as pinning the full KV cache of a user session to a specific GPU long-term. 25

Why don't people do strongly stateful KV serving directly? Because that would bind sessions strongly to specific GPU/nodes, hurting load balancing, fault tolerance, and multi-tenancy isolation. As long as any request can go to any replica, the replica must be able to handle the request without prior context; so the most conservative design requires the client or upper-level service to resend the necessary context. Prompt Caching is a clever compromise: OpenAI documentation explicitly says cache hits only occur on exact prefix matches, and now you can improve hit rate via prompt_cache_key and extended prompt caching up to 24 hours. It does reduce prefill cost, but it's still an engineering compromise around prefix reuse, not a general cross-request KV memory system. 11

This is the essence of many “pseudo-agent” systems. Logically, they look like multi-turn agents: user asks a question, tool runs a round, model thinks another round. But physically, in many cases it's actually: every round reconstructs the prompt and does a full prefill. As the context grows longer, with more chains and more tools, what you see isn't “steady state retention”—it's “history repeated over and over.” This makes prefill cost very high, and explains why prompt caching / prefix reuse / PD disaggregation have become so important. 11

Therefore, the real contradiction for online systems isn't “engineers don't know KV cache is useful”—it's:

How to recover the benefits of KV cache reuse within a scalable, fault-tolerant, schedulable cloud architecture.

First-generation KV-centric architectures solve this contradiction.

12. First-Generation KV-Centric Architectures: DistServe, Mooncake, LMCache

If vLLM and SGLang primarily solve “how to run better within a single engine or single node,” then starting in 2024, a wave of systems began to upgrade the question to:

Across requests, nodes, and phases—how should KV cache become a system-level resource?

In this line, DistServe, Mooncake, and LMCache are three key milestones.

DistServe: Disentangle prefill and decode first

DistServe starts with goodput: existing serving systems mix prefill and decode on the same GPUs, which causes prefill-decoding interference and resource coupling. The former makes the two phases mutually bottleneck each other; the latter means resource allocation can't be optimized separately for TTFT vs TPOT goals. So DistServe directly implements prefill/decode disaggregation: it assigns prefill to one set of GPUs, decode to another set, then jointly optimizes resource allocation and parallel strategy according to the application's TTFT/TPOT objectives. The paper reports that across different models and workloads, DistServe can significantly increase the number of serviceable requests under latency constraints. 5

The importance of DistServe isn't that it's the only answer—it's that it first systematically turned a common sense into an architectural principle: prefill and decode don't demand the same resources.

Mooncake: Elevate KV cache to scheduling core

Mooncake goes one step further. It defines itself as a KVCache-centric disaggregated architecture. In Mooncake, prefill and decode clusters are separated, and the system leverages the otherwise underutilized CPU, DRAM, SSD, and NIC resources in GPU clusters to build a distributed KV cache; the core is a scheduler designed around KV cache, which balances scheduling under throughput, SLO, and overload conditions. The Mooncake paper reports that on Kimi workloads, it can handle about 75% more requests in real-world scenarios. 26

Mooncake's approach is noteworthy: it no longer treats KV cache as “a temporary byproduct left after model execution”—it treats it as the object of system scheduling. This is a major paradigm shift.

LMCache: Turn KV cache into a shared layer

If Mooncake leans more toward architecture and scheduling, LMCache is more like abstracting KV into a pluggable cache layer. The LMCache paper clearly states its positioning: it extracts and stores KV cache from modern LLM engines like vLLM and SGLang, then shares these KV caches across queries and across engines, supporting both prefix reuse and cross-engine KV transfer under PD disaggregation. The paper reports that when combined with vLLM, throughput can be improved up to 15x on some workloads. 27

So if you force me to summarize the three with the most concise division of labor, I'd put it this way:

DistServe: first separates prefill and decode, solving the two-phase interference problem;
Mooncake: makes KV cache the core of distributed scheduling;
LMCache: abstracts KV cache into a storable, migratable, reusable system layer.

At the same time, RadixAttention in SGLang provides another critical capability: automatically discovers and reuses shared prefixes in complex LM programs. The SGLang paper emphasizes that its runtime can leverage RadixAttention for KV cache reuse, and significantly improves throughput across multiple tasks. 28

Interestingly, by early 2026, the official documentation of both Mooncake and LMCache explicitly shows integration: Mooncake can be used as a backend storage and transfer engine for LMCache, and they even directly demonstrate a PD-disaggregated demo with LMCache + Mooncake + vLLM. In other words, in practice they aren't “either-or”—they're often used together. 29

13. Why Mooncake / LMCache Are Not the End

If Mooncake / LMCache have already elevated KV to first-class citizen, why have a flurry of “next-generation” papers still emerged after 2025?

Because once you make KV a system-level resource, new bottlenecks immediately become exposed.

First, KV is too large to move cheaply

The default assumption in Mooncake/LMCache is still: KV is worth storing, moving, reusing. But once context grows longer, requests multiply, and agents get more complex, KV cache volume quickly outgrows GPU capacity. At that point the problem changes from “do we have cache” to “how can cache efficiently move back to GPU from CPU/SSD/remote memory.”

The abstract of Strata almost directly calls out this issue: under long contexts, hierarchical caching is unavoidable, but reloading large cached contexts back to GPU encounters severe bottlenecks—fragmented I/O from paged layouts can't saturate bandwidth, and existing schedulers don't account for cache-loading delay, so the system becomes loading-bound instead of compute-bound. 30

Second, prefix reuse can conflict with latency goals

Automatic prefix reuse doesn't automatically mean better online latency. The NeurIPS 2025 paper on RadixAttention scheduling from LinkedIn did something very important: after formalizing online scheduling with prefix reuse, it proved that this problem is NP-hard under TTFT constraints. More intuitively, simply greedily pursuing longest-prefix-match can cause TTFT to explode for some requests. The authors therefore propose k-LPM to balance prefix reuse with fairness/waiting time. 10

What does this tell us? It tells us that Mooncake / LMCache solved “can we use cache,” but haven't fully solved “when should we use cache, and who gets priority.”

Third, reuse patterns for agent workloads differ from vanilla LRU

KVFlow directly targets agentic workflows: current systems do prefix caching but usually use LRU eviction, which evicts an agent's KV cache right before it's called again. KVFlow therefore introduces workflow-aware Agent Step Graph, fine-grained eviction, and proactive prefetch. Essentially, it's saying: in agent scenarios, cache management must understand workflow structure, not just recent access time. 31

Fourth, preemption and context switching have costs

FastSwitch exposes another problem: existing block-based KV cache allocation reduces memory waste but leads to insufficient context switching granularity and high switching overhead. FastSwitch therefore proposes a fairness-aware serving system that specifically optimizes the efficiency of preemption/context switching. In other words, once KV becomes state, preemption isn't free anymore. 32

Fifth, exact prefix matching is inherently too restrictive

In scenarios like RAG, two requests often don't have “exactly identical prefixes”—they share a lot of retrieved context but aren't strictly prefix-aligned. CacheBlend moves forward on this problem: it doesn't require strict prefix matches, instead allowing reuse of already cached KV and only doing selective recompute for a small number of new tokens, which significantly improves TTFT and throughput for RAG. 33

So the reason Mooncake / LMCache “aren't the latest” isn't that they're obsolete—it's that they pushed the problem to the next stage. They solved: getting KV into the system picture. And the next generation of work solves: how to handle I/O, scheduling, agent reuse, hierarchical caching, fairness, and partial recompute after KV becomes a system resource.

14. Next-Generation Memory-Centric Architectures: Strata, CAKE, R-KV, KVFlow, FastSwitch, CacheBlend

I prefer to call work after 2025 memory-centric, not just KV-centric. Because by now the research focus isn't just “is cache being reused”—it's:

How to manage KV cache as a true hierarchical memory system.

Strata: Hierarchical caching + GPU-assisted I/O + cache-aware scheduling

Strata can be seen as the most systematic upgrade after Mooncake/LMCache. It doesn't focus on “how to do prefix reuse”—it focuses on when long-context cache is stored in hierarchical layers, how to efficiently move it back to GPU. The paper proposes GPU-assisted I/O, GPU/CPU layout decoupling, and cache-aware scheduling, and reports up to 5x lower TTFT compared to vLLM + LMCache on long-context benchmarks. 30

This work is critical because it turns the question of “where does KV cache live” from a yes/no question into a multi-level question: HBM, CPU DRAM, SSD, and even remote memory pools can all be part of the cache hierarchy.

CAKE: Layer-aware eviction

CAKE's contribution is that it makes the question of “who to evict” global and structured. Instead of treating eviction as simple LRU, it does cascading allocation combining layer-specific preference and temporal dynamics. What's most memorable isn't the specific algorithm—it's the signal from its results: many layers, many tokens, at many time steps, KV is actually not valuable. 8

R-KV: Reasoning-specific compression

R-KV deserves special mention because it directly targets today's hottest workload: reasoning. The authors point out that reasoning models often generate excessively long outputs, and existing compression approaches fail on reasoning failure, so they specifically compress redundancy for reasoning. The result is also iconic: 10% KV achieves near-full performance, 16% KV can even outperform baseline. 9

This tells us that reasoning scenarios aren't just “longer”—they also mean the redundancy structure differs from ordinary chat output.

FlexiCache: Hierarchical management by head stability

FlexiCache pushes memory policy down to the attention head level: stable heads keep only top-K pages on GPU; unstable heads keep more hot pages. This represents another important trend: cache management is increasingly getting closer to the internal structure of the model at finer granularity. 24

KVFlow: workflow-aware cache for agents

KVFlow takes a clear stand on the side of agent workflows. Its core idea is simple but powerful: agent workload isn't a random sequence—it's an Agent Step Graph with workflow dependencies; therefore cache policy should “know” which agent is more likely to be activated again next step. Combined with fully overlapped prefetch, it's already very similar to “predict what will be used next” in CPU cache. 31

FastSwitch: context switching is also a cost

FastSwitch shows that when you make a large number of requests preemptible and interruptible KV-stateful processes, context switching itself becomes the bottleneck. This is increasingly like process switching in operating systems: it's not that preemption can't be done—it's that the granularity of preemption, context layout, and recovery cost must all be carefully designed. 32

CacheBlend: From exact prefix reuse to approximate reuse

CacheBlend is noteworthy because it points out that exact prefix reuse is too limited. For scenarios like RAG, shared context doesn't have to be “exactly the same prefix”—it's still worth reusing a large chunk of history. CacheBlend enables this approximate reuse with a small amount of selective recompute, which essentially turns “cache” and “compute” into a continuum rather than an either/or. 33

When you put all this work together, you'll find that the keywords for the new generation architecture aren't just simple “cache reuse”—they are:

hierarchical tiers
cache-aware scheduling
workflow-aware eviction
partial recompute
proactive prefetch
fairness-aware switching

This is already entirely the language of memory systems.

15. CXL: Why It Looks Like the Ultimate Answer But Remains Challenging in Practice

CXL is exciting because it superficially looks like a path that “has it all”: capacity can be expanded, memory pools can be shared, load/store semantics are more natural, and it looks more like real memory than RDMA.

Beluga's abstract is representative: it proposes using a CXL switch to let GPUs and CPUs access a shared large-scale memory pool, and emphasizes that this load/store access semantics delivers near-local memory latency while reducing synchronization and programming complexity. The paper reports significant improvements in TTFT and throughput over an RDMA baseline on vLLM. 34

But it would be too optimistic to conclude that “CXL is the final answer.”

First, CXL solves for capacity, not HBM-level bandwidth

The value of HBM isn't just that it's close—it's close and high-bandwidth. CXL can create larger shared memory pools, but it doesn't automatically give you HBM-level throughput. Beluga's improvement holds relative to more roundabout paths like RDMA; it doesn't mean CXL can replace GPU local VRAM without cost. 34

Second, CXL isn't automatically a paradise of coherence

TraCT's abstract is very instructive. It explicitly points out that to achieve rack-scale KV cache based on CXL shared memory, you must handle synchronization, consistency, and data management on non-coherent CXL memory. In other words, in practice, at least for many commercial-quality deployment configurations, CXL isn't “naturally globally coherent UMA.” You still have to implement software protocols yourself. 35

Third, data movement isn't always cheaper than recomputation

When KV is large enough, access is scattered enough, and contention is high enough, cross-tier data movement itself becomes the main cost. TraCT specifically identifies KV transfer as the fundamental bottleneck for PD disaggregation; CXL-SpecKV has to introduce speculative prefetch + FPGA compression/decompression to keep the cost of disaggregated KV cache under control. 35

Fourth, CXL solutions often require more “system patches”

Beluga reduces programming complexity through shared pool + native load/store; TraCT handles non-coherence through software-level synchronization; CXL-SpecKV compensates for bandwidth/latency issues through speculative prefetch and compression. You'll find that CXL isn't a “buy-and-use” hardware silver bullet—it's a platform that requires co-design of system/software/hardware. 34

So I'd summarize the real positioning of CXL in one sentence:

CXL is important, but it's more like “adding another memory tier” rather than “making all remote memory local HBM."

Once you understand it this way, many phenomena are no longer contradictory: Why do CXL solutions always discuss prefetch, compression, shared pool, non-coherence, software sync? Because what they're essentially doing is—acknowledging that remote memory is still slower, then trying to hide that slowness as much as possible.

Recent Progress: Samsung's Hardware-Software Co-Optimization for the LBC Trilemma

A few years ago, there was a narrative that “CXL is dead in the AI era.” This was a narrow conclusion based on the static inference patterns of early LLMs. Today, as we transition to Agentic AI, the bottleneck has shifted. The demand for “memory” from agents—specifically KV cache sharing and Vector Database indexing—has made CXL memory expansion a non-negotiable architectural requirement.

Samsung recently published three cutting-edge works at IEEE that directly address the LBC (Latency-Bandwidth-Capacity) trilemma, breaking this problem through deep co-optimization of hardware, software, and system-device co-design:

S-CHMU (CXL Hotness Monitoring Unit): Managing data placement in tiered memory traditionally incurs significant software overhead. S-CHMU provides hardware-assisted hotness tracking of memory access patterns, enabling low-latency, high-precision data migration between local DRAM and CXL-attached memory. 41
CXL-PNM (CXL Processing Near Memory): Moving massive datasets across the CXL link creates bandwidth bottlenecks. The PNM (Processing Near Memory) approach offloads data-intensive operations—such as vector database search and KV cache lookup—directly to the memory device, drastically reducing data movement and maximizing effective bandwidth. 42
Pangaea V2 (CXL Memory Orchestrator for K8s): For CXL to be usable in production, it must be cloud-native friendly. Pangaea V2 is an advanced Kubernetes orchestrator that abstracts CXL resources, allowing dynamic and transparent allocation of expanded memory pools across containerized workloads. 43

This further validates the core thesis of this article: the next wave of innovation in AI inference increasingly depends on co-optimization across systems, hardware, and software—not just improvements to the model itself.

16. LLM Serving Is Becoming an “Operating System Problem”

When you compare today's mainstream LLM serving systems with DNN serving systems from five years ago, the biggest change isn't larger models—it's that the abstraction layer has changed.

You can see a whole set of concepts that increasingly resemble an operating system:

pages / blocks: PagedAttention is like virtual memory paging; 21
radix tree: RadixAttention is like a prefix-based shared index; 28
cache hit / miss: prompt caching, prefix reuse, KVFlow, CacheBlend all revolve around hit rate; 11
eviction policy: CAKE, KVFlow, FastSwitch all study who should be evicted; 8
prefetch: systems like KVFlow, CXL-SpecKV, Beluga all emphasize predictive loading; 31
scheduling under SLO: DistServe, Revisiting SLO and Goodput, and the LinkedIn k-LPM theoretical work all treat serving as latency-constrained online scheduling. 5

Even the evaluation metrics are increasingly unlike traditional training. In the training era, people compare tokens/sec, TFLOPS utilization, samples/sec. In the serving era, what's increasingly critical is: TTFT, TPOT, P99, SLO attainment, goodput. Revisiting SLO and Goodput Metrics in LLM Serving explicitly points out that traditional goodput metrics can even encourage behaviors that contradict user experience, so we need to redefine the serving metric framework. 22

This means a very fundamental change: We used to say “the core of LLM infra is distributed training systems.” Today a more accurate statement might be: the core of LLM serving is becoming an online memory operating system for KV state.

This “operating system” needs to solve:

How to split state into pages;
How to place it across multiple storage tiers;
How to decide who stays resident on GPU;
How to predict who will be used next;
How to balance fairness, throughput, and tail latency;
How to compress, recompute, preempt, and migrate when necessary;
How to make prefix sharing coexist with multi-tenancy.

From this perspective, Mooncake/LMCache are just the first step to “building the file system”; Strata, KVFlow, FastSwitch, Beluga, and TraCT are moving toward a “complete memory manager.”

17. What Comes After KV Cache: Mamba, RWKV, and “Online Compressed Memory”

If we push the question all the way to the end, a more radical question emerges:

Since everyone accepts that KV cache is large, expensive, and redundant, why not just get rid of KV cache entirely?

This is the attraction of directions like Mamba and RWKV.

Mamba's core claim is clear: Transformers have a fundamental computational efficiency problem on long sequences, so it uses selective state spaces to build a linear-time sequence model. The paper emphasizes that Mamba is fully recurrent, and reports strong performance on language modeling while delivering better inference throughput than Transformers of the same scale. 36

RWKV's statement is also direct: it wants to combine the advantages of Transformer parallel training with the efficient inference of RNN, achieving linearly scalable inference. 37

Put more systematically, Transformer KV cache is a kind of external memory that archives history token-by-token; while models like Mamba/RWKV are more like trying “online compressed memory”: instead of keeping every piece of history intact, they continuously compress history into a recurrent state.

This path is extremely attractive because it fundamentally avoids the problem of “must read full historical K/V every step.” But reality is clear: as of 2026, Transformers remain the mainstream for industrial deployment, and system innovation around KV cache is still progressing rapidly. This shows that the short-to-medium-term consensus in industry isn't “replace Transformers immediately”—it's first optimize the KV-based world to the extreme. 36

Therefore, the future after KV cache is most likely not a story of instantaneous replacement—it's two lines progressing in parallel:

One line continues to turn Transformer serving into a truly mature memory system;
Another line explores how state-space / recurrent / hybrid architectures reduce the cost of historical storage at the model level.

These two paths don't conflict. On the contrary, all the system understanding we've gained around KV cache today will become the foundation for understanding the next generation of sequence architectures.

18. Conclusion: Three Fundamental Trends

If I had to condense the entire article into three main themes, I'd summarize it this way.

First Trend: From Compute to Memory

In the Transformer training era, the focus is on FLOPS; In the Transformer inference era, the focus is increasingly working set, bandwidth, latency, state reuse. FlashAttention is important, but it's not the endgame; the real endgame question is: how should historical information be stored, accessed, moved, and compressed. 19

Second Trend: From Stateless to Stateful

Cloud APIs naturally偏向 stateless for scalability, fault tolerance, and multi-tenancy; But agents, multi-turn dialogue, long CoT, and RL rollout naturally need stateful. So almost all the most important serving innovations of the past two years attempt to reconcile this contradiction: prompt caching, prefix reuse, RadixAttention, LMCache, Mooncake, DistServe, KVFlow, Beluga, TraCT. Their common goal is: recover the benefits of state reuse without sacrificing the elasticity of cloud architecture. 11

Third Trend: From Model to System, to Memory OS

We used to think of “model innovation” and “system optimization” as separate. But today, structural designs like GQA directly affect bandwidth; compression methods like R-KV directly affect serving cost; systems like KVFlow/FastSwitch/Strata deeply leverage temporal stability, prefix structure, and inter-layer differences within the model. At this stage, model, engine, scheduling, and memory hierarchy are already tightly coupled. 23

Therefore, this article doesn't really want to argue “whether Mac is better or NVIDIA is better,” nor “whether Mooncake and LMCache are obsolete.” What I want to emphasize more is:

The next stage of large language model inference is increasingly less a “neural network operator optimization” problem and more a “how to turn history into manageable memory” systems problem.

Once you accept this, many seemingly fragmented phenomena from the past become unified:

Why does RL rollout make inference more expensive than training;
Why does Mac's unified memory become relevant again;
Why is FlashAttention not enough;
Why have vLLM, SGLang, Mooncake, LMCache, Strata, Beluga, TraCT emerged simultaneously;
Why is CXL both exciting and problematic;
Why might KV cache compression be more important than parameter compression;
Why does future LLM serving increasingly look like an operating system.

And this, perhaps, is the most important change in AI infrastructure that deserves careful understanding over the past two years.