GPU – The powerful machinery, and the memory,  everyone obsesses over how smart a model is. Almost nobody asks the question that actually crashes it in production: where does the model put what it is remembering, and what happens when that space runs out?

AI Agents #AILabPage

This is a story about a trap. A clever engineer, serving a reasoning model, watches the GPU run out of memory. So they do the obvious thing. They throw away ninety percent of what the model is holding. And the memory usage does not move. Not a little. Almost not at all. The GPU still runs out.

It looks like magic, or a bug. It is neither. It is one of the most elegant and least understood problems in modern AI infrastructure, and once you see it, you cannot unsee it. I want to draw it for you the way I wish someone had drawn it for me — no heavy mathematics, just the mental model, an analogy you already know, and the one number that quietly decides everything.

I run small and mid-sized models on my own hardware, on purpose, because in my world — payments, identity, financial-crime compliance — where the data goes is the entire game. And the moment you bring intelligence home onto your own machine, this exact problem is the first wall you hit. So this is not theory for me. It is the wall between a 32-billion-parameter reasoning model running quietly on a card you own, and a cloud bill you did not want.

Why evicting 90% of the KV cache frees almost nothing — and the one number that actually decides whether your reasoning model lives or dies.

A plain-English field guide to the KV cache: why it grows, why deleting most of it frees nothing, the two hidden walls underneath, and how a recent NVIDIA and MIT method finally walks around all of it. One garage, a thousand floors, and a single counting mistake that crashes GPUs every day.Table of Contents

01 – The Setup

What the model is actually remembering

When a model generates text, it does not re-read everything from scratch for each new word. That would be agonisingly slow. Instead it keeps a running set of notes about every token it has seen so far. Those notes are the KV cache — Key and Value vectors, stored at every single layer of the model.

Here is the part that matters. The cache only grows. Every new token the model writes appends a fresh set of K and V vectors, across every layer, and nothing is released while generation continues. A short answer barely touches it. But a reasoning model — the kind that “thinks out loud” through a long chain of thought before answering — can generate tens of thousands of tokens in a single turn. Every one of them leaves its notes behind.Each token appends K/V at every layer — and never lets got₁t₂t₃t₄→ … →KV cache (grows left → right, never shrinks)tokens generated in this turn↑ memorynever freedmid-turn

What the model is actually remembering

Diagram 1. Each token a model generates appends its Key and Value vectors at every layer. The cache only grows during a turn — nothing is freed. A long chain of thought is a long, ever-growing stack of notes.

The headline number is how smart the model is. The number that crashes you is how much it is trying to remember.

What the model is actually remembering

Diagram 2. Model weights sit flat and fit comfortably. The KV cache climbs with every generated token until it crosses the card’s memory ceiling — here, around 24K tokens on a 24GB GPU — and the run dies mid-thought.

Put real figures on it. A 32K-token chain of thought caches roughly 32K tokens’ worth of KV vectors. Run a 32-billion-parameter model with 4-bit weights on a 24GB GPU and you will hit out-of-memory somewhere around 24K tokens — before the model has even finished thinking. The weights fit. The reasoning does not. That is the wall.The weights fit. The reasoning is what runs you out of memory.0~24K tokens32Ktokens generated →GPU memory →24 GB GPU limitmodel weights (4-bit) — flatKV cache (rises with every token)OUT OF MEMORYbefore it finishes thinking

02 – The Obvious Fix That Fails

The obvious fix that quietly fails

So you reach for the obvious lever. Attention is sparse: when a model writes a token, it leans heavily on a handful of earlier tokens and barely glances at the rest. If most tokens are nearly ignored, why keep them? Score every token by importance, keep the vital ones, evict the rest. Throw away ninety percent of the cache and free ninety percent of the memory.

You run it. The cache logically holds a tenth of what it did. And the GPU’s memory usage stays almost exactly where it was. You evicted ninety percent of the tokens and freed almost nothing.

This is the trap. And the reason is not in the model at all. It is in how the memory underneath is handed out. You did not delete the wrong tokens. You deleted them in the wrong shape.

03 – The Parking Garage

The parking garage in your GPU

Modern serving engines like vLLM use something called paged attention. Instead of giving each sequence one long contiguous slab of memory, they chop GPU memory into fixed physical blocks, each one holding the KV vectors for about 16 tokens. It is the same idea as virtual memory in an operating system: lots of small, equal pages handed out on demand.

Here is the rule that makes or breaks everything: a block returns to the allocator only when every single slot inside it is empty. One surviving token in a block of sixteen, and the whole block stays locked.

Now picture a multi-storey car park where each floor holds sixteen cars, and the rule is the same: management can only close and reclaim a floor when every space on it is empty. Your eviction logic selects tokens by importance, and important tokens are scattered everywhere — one on this floor, two on that one. So you drive ninety percent of the cars out. But because the survivors are sprinkled across nearly every floor, almost every floor still has a car or two parked on it. And a floor with even one car cannot be closed.

You removed most of the cars and freed not one floor.Five blocks · 16 slots each · a block frees only when fully emptyBEFORE — every slot filledAFTER 90% EVICTION — survivors (navy) scattered; every block still occupiedEvicted: ~90% of tokens.Blocks fully emptied and returned to the allocator: zero. Every block still holds a survivor.

The parking garage in your GPU

Diagram 3. Top: a full cache, every block packed. Bottom: importance-based eviction removes most tokens, but survivors (navy) land scattered, leaving at least one in nearly every block — so the allocator can reclaim none of them. Evicted tokens: high. Freed blocks: zero.

Let me make it concrete with numbers, because this is the whole article in one table.

StateTokens in cacheEvictedSurvivingBlocks (16 ea.)Blocks emptiedMemory freed
Before eviction16,00016,0001,000
After 90% eviction16,000~14,000~2,0001,000~0≈ nothing

Two thousand survivors, sprinkled across a thousand blocks, is more than enough to leave a token in almost every block. The allocator frees almost none of them. You did the work, paid the cost, and reclaimed nothing.You optimised the wrong numberEvicted tokens~14,000 (88%)what teams celebrate — a token-level numberFreed blocks~0what the GPU actually returns — the number that mattersThe garage cares aboutfloors,not cars.Count freed blocks.Not evicted tokens.

The parking garage in your GPU

Diagram 4. The counting mistake in one picture. Eviction rate is a token-level number that looks like progress. Freed blocks is the number the allocator actually honours — and it stays near zero.

Memory is freed per block, not per token. The number that decides whether compression works is freed blocks, not evicted tokens.

That single sentence is the thing most teams never internalise. They optimise the eviction rate — a token-level number — and wonder why the GPU still falls over. They were counting cars. The garage only cares about floors.

04 – The Two Hidden Walls

The two hidden walls underneath

Say you try to be clever and reuse the empty spaces inside half-full blocks. Now you hit two more walls, and they are why this problem stayed unsolved for so long.

Wall one — you break the order of time. The cache is meant to sit in token order: position 38, 39, 40, 41. Suppose token 16,001 arrives and you drop it into the slot token 40 used to hold. The cache now reads 38, 16,001, 41. Time is scrambled. Attention can still compute the right answer from a scrambled cache, but only if every slot now carries a separate note recording its true position. That bookkeeping is a real, recurring cost, paid on every step, forever.

Wall two — your fast kernel hid the very thing you need. To evict tokens by importance, you need attention scores — which tokens were actually attended to. But the fast attention kernels everyone uses, like FlashAttention, never store the full attention-score grid. They compute attention in small tiles and throw the grid away the instant they are done. That discarding is precisely why they are fast and memory-light. So to recover the scores your eviction needs, you must fall back to a slower, eager attention that materialises the whole grid — and you hand back the speed you came for.WALL 1 — order of time breaks381600141a new token dropped into an old hole →cache no longer in token order →every slot now needs a position note (a tax, every step)WALL 2 — the scores are goneFlashAttentioncomputes in tiles……then throws the full score grid awayno scores to evict by→ fall back to sloweager attention

The two hidden walls underneath

Diagram 5. Wall one: reusing holes scrambles token order, forcing per-slot position bookkeeping. Wall two: fast kernels never save attention scores, so importance-based eviction must fall back to slow eager attention. Two quiet taxes that cancel the saving you were chasing.

The wallWhat it costs youWhy it exists
Freed blocks ≈ 0The memory never actually returnsAllocation is per-block; survivors are scattered
Broken token orderPer-slot position bookkeeping, every stepReusing holes puts time out of sequence
Missing attention scoresLose FlashAttention speedFast kernels discard the score grid by design

Three walls, one conclusion: naive eviction is a beautiful idea that production memory refuses to honour.

05 – How NVIDIA Walks Around It

How NVIDIA walks around the whole problem

In April 2026, researchers from MIT, NVIDIA and Zhejiang University published TriAttention (“Efficient Long Reasoning with Trigonometric KV Compression”). What I admire about it is that it does not fight the walls head-on. It changes the question so the walls are never built.

It never needs attention scores. Instead of asking “which tokens were attended to” — the question that forces you back to slow eager attention — it scores a token from the geometry of the model’s Key and Query vectors, before the rotary position encoding (RoPE) is applied. In that pre-RoPE space, Q and K vectors cluster around stable, fixed centres that barely move with position. So importance can be read from the shape of the vectors themselves, at any distance, without ever looking at an attention grid. Wall two vanishes. FlashAttention stays.

It physically tidies the garage. Every 128 decoded tokens, TriAttention runs a compaction pass: the surviving tokens slide forward and pack together into a dense prefix, closing the holes that eviction tore open. Now whole blocks at the tail empty out cleanly and return to the allocator — and the cache stays in true token order, so there is no position bookkeeping to pay. Walls one and three vanish together.MOVE 1 — score from geometry, not from scoresstable centrepre-RoPE Q/K cluster tightly →importance read from shape,no attention grid requiredMOVE 2 — compact survivors, free whole blocksscattered survivors↓ slide forwarddense prefix · tail blocks freed · order keptwhole blocks returnedfreed blocks finally tracks evicted tokens —the garage closes whole floors

How NVIDIA walks around the whole problem

Diagram 6. TriAttention’s two moves. Left: score tokens from stable pre-RoPE Q/K geometry, so no attention grid is needed and FlashAttention stays. Right: compact survivors into a dense prefix so tail blocks empty and return to the allocator, with token order preserved.

MetricFull attentionTriAttention
KV cache memorybaseline~10.7× less
Decode throughputbaseline~2.5× faster
Reasoning accuracybaselinematched
32B reasoner on a 24GB cardout of memoryruns

The breakthrough was not deleting more cleverly. It was making the deletions line up, so the memory could actually come back.

The results are not incremental. On long reasoning traces, TriAttention matches full-attention accuracy while decoding 2.5× faster and using 10.7× less KV memory — and it lets a 32-billion-parameter reasoning model run on a single 24GB RTX 4090, a workload that simply out-of-memories under full attention.

06 – Why It Matters

Bring the reasoning home

You could file this as a niche kernel curiosity. I would not. It points at something I keep arguing, from a different direction each time.

The future everyone is selling is agentic and reasoning-heavy — models that think for thousands of tokens before they act. That thinking has a physical bill, and the bill is paid in memory long before it is paid in money. KV cache growth is the hidden tax on every long chain of thought. If you do not understand where the wall is, you reach for the cloud reflexively, rent a bigger GPU, and never ask whether you needed to.

But understand the real bottleneck — freed blocks, not evicted tokens — and the picture inverts. A method like TriAttention is the difference between a serious reasoning model living on a single card you own, on your own desk, with the network unplugged, and the same model living on someone else’s meter. For my world, where a leaked record is not an inconvenience but a breach of someone’s life, that distinction is not about cost. It is about sovereignty.

Own what compounds, rent what rots. A reasoning model that runs on hardware you control, on data that never leaves the room, is exactly the kind of thing that compounds.

So the next time someone tells you AI is all about bigger, smarter models, remember the parking garage. The cleverness is in the model. The survival is in the memory. And the people who quietly win are the ones who learned to count floors, not cars.

Pick a reasoning model tonight, watch your KV cache climb as it thinks, and feel where the wall actually is. That wall, not the leaderboard, is where the real engineering lives.

Machine Learning (ML) - Everything You Need To Know

Conclusion – The future isn’t only about how clever a model can be — it’s about whether you can afford to run its thinking on terms you control. Reasoning models will sense, plan, and act across longer and longer chains of thought. But every one of those thoughts leaves notes behind, and notes need somewhere to live. The teams that thrive won’t be the ones with the biggest cloud bill.

They’ll be the ones who understood the humble physics of memory: that you free space per block, not per token, and that the whole game is making your deletions line up. Master that, and intelligence you thought you had to rent quietly becomes intelligence you can own.

Points to Note

Knowing which compression trick to reach for — eviction, quantization, compaction, or some blend — is a judgement call that only experience and the problem in front of you can settle. If you got the freed-blocks-versus-evicted-tokens distinction before I spelled it out, take a bow. If not, you now hold the one idea that most of the field learns the hard way, in production, at 2am.

Books + Other readings Referred

  • TriAttention: Efficient Long Reasoning with Trigonometric KV Compression — MIT, NVIDIA & Zhejiang University, April 2026.
  • NVIDIA Research blog: KV Cache Compression and Its Infra Problems.
  • vLLM / PagedAttention and FlashAttention reference work; NVIDIA KVPress library.
  • Research through the open internet, white papers, and hands-on experience of @AILabPage (self-taught learners group) members.
  • Lab and hands-on experience of  @AILabPage (Self-taught learners group) members.

Feedback & Further Question

Do you have any burning questions about Big Data, “AI & ML“, BlockchainFinTech,Theoretical PhysicsPhotography or Fujifilm(SLRs or Lenses)? Please feel free to ask your question either by leaving a comment or by sending me an email. I will do my best to quench your curiosity.

====================== About the Author =================================

Read about Author  at : About Me   

Thank you all, for spending your time reading this post. Please share your feedback / comments / critics / agreements or disagreement.  Remark for more details about posts, subjects and relevance please read the disclaimer.

FacebookPage                Twitter                          ContactMe                          LinkedinPage    =========================================================================

By V Sharma

A seasoned technology specialist with over 22 years of experience, I specialise in fintech and possess extensive expertise in integrating fintech with trust (blockchain), technology (AI and ML), and data (data science). My expertise includes advanced analytics, machine learning, and blockchain (including trust assessment, tokenization, and digital assets). I have a proven track record of delivering innovative solutions in mobile financial services (such as cross-border remittances, mobile money, mobile banking, and payments), IT service management, software engineering, and mobile telecom (including mobile data, billing, and prepaid charging services). With a successful history of launching start-ups and business units on a global scale, I offer hands-on experience in both engineering and business strategy. In my leisure time, I'm a blogger, a passionate physics enthusiast, and a self-proclaimed photography aficionado.

Leave a Reply

Discover more from Vinod Sharma's Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading