語言模型能否端到端學習管理KV快取,邊推理邊遺忘?
語言模型能否端到端學習管理KV快取,邊推理邊遺忘?
NGC核心概念
「Neural Garbage Collection (NGC)」讓語言模型透過單一任務獎勵的強化學習,端到端 (End to End) 同時學習推理與KV快取管理,無需監督微調 (SFT)、代理目標或自然語言摘要。作者Michael Y. Li強調,深度學習的核心教訓是能力從端到端最佳化湧現,而非依賴人工設計的啟發式或強歸納偏置;現有效率方案過度仰賴手動方法,NGC則讓模型自行管理記憶。
長推理的KV快取瓶頸
當語言模型進行更長的思考鏈 (chain-of-thought) 推理時,能解決原本無法觸及的問題,但每個推理步驟皆膨脹KV快取,形成測試時擴展的瓶頸。NGC透過強化學習 (RL) 訓練模型邊推理邊管理快取:模型定期暫停,(1) 用自身注意力機制評分快取項目,(2) 透過Gumbel top-k技巧抽樣驅逐項目,(3) 從修剪後快取繼續推理。
端到端聯合優化機制
語言模型已透過RL訓練推理,每個token為從模型抽樣的離散動作,受任務獎勵塑造;NGC將KV快取驅逐視為另一離散動作,同樣由模型抽樣,具精確對數機率,便於政策梯度優化。單一基於結果的任務獎勵端到端訓練一切:模型驅逐的內容形塑其記憶→記憶形塑推理→推理正確性即獎勵,無需額外參數、獨立評分模組或分階段訓練。
評分與抽樣創新
預訓練已賦予模型對相關性良好的先驗,NGC直接用其注意力機制評分快取項目,無需新訓練超出RL本身。快取驅逐原本確定性,NGC轉為隨機動作,使用Gumbel top-k基於模型評分抽樣,確保token與驅逐皆為可計算對數機率的離散動作,讓政策梯度(推理模型訓練主力)同時優化兩者。
實證成果與基準比較
在AIME 2025、Countdown及AMC任務上,NGC於峰值KV快取大小壓縮2-4倍時,大幅優於手動設計基準,維持相對於全快取上限的強大推理準確率;Countdown及AMC達2-3倍壓縮仍保有高準確度。作者批判現有方法由模型外部管理限制,NGC證明端到端學習可取代人工設計。
預算感知內省功能
NGC讓模型感知自身記憶預算:在推理前於提示中注入
更廣遠景與論文資訊
此論文為邁向廣大願景的第一步:端到端最佳化不僅驅動能力,還能驅動效率,讓語言模型在能力與效率上自我提升。作者與Jubayer Ibn Hamid、Emily B. Fox、Noah D. Goodman合著,論文發表於2026年4月20日(https://arxiv.org/abs/2604.18002),強調若模型能學會推理,為何不能學會遺忘,挑戰當前手動管理的可擴展性限制。
Can a language model learn, end-to-end, what to keep in its own KV cache and what to throw away? Can it learn to forget while it learns to reason?
— Michael Y. Li (@michaelyli__) April 22, 2026
Deep learning's central lesson: capability emerges from end-to-end optimization, not heuristics/strong inductive biases. But for… pic.twitter.com/nf4jpYE3g0
When LMs think longer, they can solve problems that are otherwise out of reach, but this rapidly grows the KV cache, creating a bottleneck to test-time scaling.
— Michael Y. Li (@michaelyli__) April 22, 2026
Our approach: let RL train the LM to both reason and manage its KV cache. Concretely, as the LM reasons, it…
LMs are already trained to reason with RL, where every token is a discrete action sampled from the model and shaped by task reward. So it’s natural to cast KV cache eviction as another discrete action sampled from the same LM.
— Michael Y. Li (@michaelyli__) April 22, 2026
A single outcome-based task reward then jointly… pic.twitter.com/sB4ZpVHCvc
The results: at 2-4× reduction in peak KV cache size, NGC significantly outperforms hand-designed baselines for cache management on AIME 2025 and Countdown and maintains strong reasoning performance relative to full-cache upper bound. 4/n pic.twitter.com/hp8LTKS5oz
— Michael Y. Li (@michaelyli__) April 22, 2026
But how do we make this work?
— Michael Y. Li (@michaelyli__) April 22, 2026
First — a new kind of action. LMs produce tokens. They were never trained, during pre-training or SFT, to act on their own cache. So how do we ask one to?
We score cache entries with the LM’s own attention mechanism. Pre-training already gives it a…
Second — make cache evictions efficiently sampleable. Cache eviction is typically deterministic. To leverage policy gradients, we make it a stochastic action with an exact log-probability by sampling cache evictions via the Gumbel top-k trick using the scores computed by the LM.…
— Michael Y. Li (@michaelyli__) April 22, 2026
Put it together, and the recipe is simple: in addition to generating tokens, the LM periodically makes another kind of discrete decision (which cache entries to evict).
— Michael Y. Li (@michaelyli__) April 22, 2026
A single outcome-based reward trains everything. No SFT. No teacher model. No auxiliary objectives. No…
Bonus: we can make the model aware of its own budget.
— Michael Y. Li (@michaelyli__) April 22, 2026
Before the LM reasons, we put the eviction rate in the prompt: <eviction_rate>50%</eviction_rate>. The model “senses” its memory constraint before it thinks.
We call this budget-aware interoception. It gives an 8–13% boost at… pic.twitter.com/4VKX3KGH0M
This paper is a first step towards a broader vision we’re really excited by: the same end-to-end optimization that drives capability can also drive efficiency, enabling an additional dimension of self-improvement in which language models become more capable and more efficient.…
— Michael Y. Li (@michaelyli__) April 22, 2026
