← 返回首頁

語言模型能否端到端學習管理KV快取，邊推理邊遺忘？

Michael Y. Li

Michael Y. Li

♥371🔁 77

𝕏 (Twitter)🔥🔥2026年4月23日

📎 來源文章 ↗查看原文 ↗

AI 中文摘要Claude 生成

語言模型能否端到端學習管理KV快取，邊推理邊遺忘？

NGC核心概念
「Neural Garbage Collection (NGC)」讓語言模型透過單一任務獎勵的強化學習，端到端 (End to End) 同時學習推理與KV快取管理，無需監督微調 (SFT)、代理目標或自然語言摘要。作者Michael Y. Li強調，深度學習的核心教訓是能力從端到端最佳化湧現，而非依賴人工設計的啟發式或強歸納偏置；現有效率方案過度仰賴手動方法，NGC則讓模型自行管理記憶。

長推理的KV快取瓶頸
當語言模型進行更長的思考鏈 (chain-of-thought) 推理時，能解決原本無法觸及的問題，但每個推理步驟皆膨脹KV快取，形成測試時擴展的瓶頸。NGC透過強化學習 (RL) 訓練模型邊推理邊管理快取：模型定期暫停，(1) 用自身注意力機制評分快取項目，(2) 透過Gumbel top-k技巧抽樣驅逐項目，(3) 從修剪後快取繼續推理。

端到端聯合優化機制
語言模型已透過RL訓練推理，每個token為從模型抽樣的離散動作，受任務獎勵塑造；NGC將KV快取驅逐視為另一離散動作，同樣由模型抽樣，具精確對數機率，便於政策梯度優化。單一基於結果的任務獎勵端到端訓練一切：模型驅逐的內容形塑其記憶→記憶形塑推理→推理正確性即獎勵，無需額外參數、獨立評分模組或分階段訓練。

評分與抽樣創新
預訓練已賦予模型對相關性良好的先驗，NGC直接用其注意力機制評分快取項目，無需新訓練超出RL本身。快取驅逐原本確定性，NGC轉為隨機動作，使用Gumbel top-k基於模型評分抽樣，確保token與驅逐皆為可計算對數機率的離散動作，讓政策梯度（推理模型訓練主力）同時優化兩者。

實證成果與基準比較
在AIME 2025、Countdown及AMC任務上，NGC於峰值KV快取大小壓縮2-4倍時，大幅優於手動設計基準，維持相對於全快取上限的強大推理準確率；Countdown及AMC達2-3倍壓縮仍保有高準確度。作者批判現有方法由模型外部管理限制，NGC證明端到端學習可取代人工設計。

預算感知內省功能
NGC讓模型感知自身記憶預算：在推理前於提示中注入50%，模型「感測」約束後思考，此「budget-aware interoception」在激進壓縮率下提升8–13%表現，並改善對未訓練預算的泛化，強化模型自我改進。

更廣遠景與論文資訊
此論文為邁向廣大願景的第一步：端到端最佳化不僅驅動能力，還能驅動效率，讓語言模型在能力與效率上自我提升。作者與Jubayer Ibn Hamid、Emily B. Fox、Noah D. Goodman合著，論文發表於2026年4月20日（https://arxiv.org/abs/2604.18002），強調若模型能學會推理，為何不能學會遺忘，挑戰當前手動管理的可擴展性限制。

Can a language model learn, end-to-end, what to keep in its own KV cache and what to throw away? Can it learn to forget while it learns to reason?

Deep learning's central lesson: capability emerges from end-to-end optimization, not heuristics/strong inductive biases. But for… pic.twitter.com/nf4jpYE3g0
— Michael Y. Li (@michaelyli__) April 22, 2026

When LMs think longer, they can solve problems that are otherwise out of reach, but this rapidly grows the KV cache, creating a bottleneck to test-time scaling.

Our approach: let RL train the LM to both reason and manage its KV cache. Concretely, as the LM reasons, it…
— Michael Y. Li (@michaelyli__) April 22, 2026

LMs are already trained to reason with RL, where every token is a discrete action sampled from the model and shaped by task reward. So it’s natural to cast KV cache eviction as another discrete action sampled from the same LM.

A single outcome-based task reward then jointly… pic.twitter.com/sB4ZpVHCvc
— Michael Y. Li (@michaelyli__) April 22, 2026

The results: at 2-4× reduction in peak KV cache size, NGC significantly outperforms hand-designed baselines for cache management on AIME 2025 and Countdown and maintains strong reasoning performance relative to full-cache upper bound. 4/n pic.twitter.com/hp8LTKS5oz
— Michael Y. Li (@michaelyli__) April 22, 2026

But how do we make this work?
First — a new kind of action. LMs produce tokens. They were never trained, during pre-training or SFT, to act on their own cache. So how do we ask one to?

We score cache entries with the LM’s own attention mechanism. Pre-training already gives it a…
— Michael Y. Li (@michaelyli__) April 22, 2026

Second — make cache evictions efficiently sampleable. Cache eviction is typically deterministic. To leverage policy gradients, we make it a stochastic action with an exact log-probability by sampling cache evictions via the Gumbel top-k trick using the scores computed by the LM.…
— Michael Y. Li (@michaelyli__) April 22, 2026

Put it together, and the recipe is simple: in addition to generating tokens, the LM periodically makes another kind of discrete decision (which cache entries to evict).

A single outcome-based reward trains everything. No SFT. No teacher model. No auxiliary objectives. No…
— Michael Y. Li (@michaelyli__) April 22, 2026

Bonus: we can make the model aware of its own budget.
Before the LM reasons, we put the eviction rate in the prompt: <eviction_rate>50%</eviction_rate>. The model “senses” its memory constraint before it thinks.

We call this budget-aware interoception. It gives an 8–13% boost at… pic.twitter.com/4VKX3KGH0M
— Michael Y. Li (@michaelyli__) April 22, 2026

This paper is a first step towards a broader vision we’re really excited by: the same end-to-end optimization that drives capability can also drive efficiency, enabling an additional dimension of self-improvement in which language models become more capable and more efficient.…
— Michael Y. Li (@michaelyli__) April 22, 2026

延伸閱讀

NGC透過單一任務獎勵強化學習

NGC透過注意力評分與抽樣驅逐