循環Transformer區塊實現測試時運算擴展,證明其收斂至固定點,形成類似前饋模型的推理階段
循環Transformer區塊實現測試時運算擴展,證明其收斂至固定點,形成類似前饋模型的推理階段。
Grigory Sapunov分享論文《A Mechanistic Analysis of Looped Reasoning Language Models》,由Hugh Blayney、Álvaro Arroyo、Johan Obando-Ceron、Pablo Samuel Castro、Aaron Courville、Michael Bronstein、Xiaowen Dong於2026年4月13日發表(arxiv.org/abs/2604.11791),以及2026年4月19日的arxiviq.substack.com分析。研究透過理論證明與實證,揭示循環語言模型(looped language models)在重複應用相同Transformer區塊時,不產生混亂噪音,而是自組織成可預測的「推理階段」,脫離傳統依賴參數數量的前饋深度擴展。
核心理論基礎
研究以狀態收斂推導注意力模式收斂為核心,證明循環模型的隱藏狀態穩定後,注意力頭混合即鎖定恆定狀態。具體而言,對初始輸入序列矩陣X∈R^{T×D},在循環步驟t與t-1間,注意力輸出S^ℓ的範數差異嚴格有界,確保結構化邏輯而非混亂。
輸入注入的必要性
單純循環隱藏狀態會導致表示崩潰,為避免此問題,必須在每次循環前將原始輸入X與當前隱藏狀態Z_{i-1}串接,並透過學習矩陣W_I∈R^{2D×D}投影,形成輸入注入(input injection)。此機制維持獨立運作階段,否則pre-norm網路每層皆崩潰至相同點,摧毀模型區隔能力。
固定點與軌跡行為
具輸入注入後,隱藏狀態不崩潰至單點,而是每個循環區塊內層收斂至獨立固定點,整區塊在隱藏空間描繪一致循環軌跡(cyclic orbit)。流程為:提示經「prelude」層產生初始狀態S'_k,進入l次循環迭代Z_i ← S'k(X, Z{i-1}),結束後交「coda」層預測token,如圖1與圖5視覺化所示。
測量指標與自組織階段
使用「ColSum Concentration」指標(注意力矩陣列和的正規化熵)量化資訊混合演進,高濃度表示注意力頭聚焦特定token而非廣泛混合。實證顯示(如圖8),循環網路自然自組織成「推理階段」,幾乎完美鏡射前饋模型,從早期上下文混合轉至晚期預測;圖2的Frobenius範數相似性證明注意力行為鎖定可預測重複節奏。
訓練穩定性驗證
為證明此為架構本質而非特定預訓練產物,研究檢視大規模預訓練模型與從頭訓練小規模網路,使用標準next-token交叉熵損失Loss=−∑_i y_i log(p_i),僅評估最終隱藏狀態,在37億token上維持固定4次循環訓練。消融研究對比「pre-norm」(Retrofitted Llama)與「sandwich norm」(Huginn-0125),確認輸入注入為穩定固定點的嚴格必要條件。
模型實例表現
- Ouro 1.4B:從頭訓練,透過循環機制複製前饋演算法階段,無需參數爆炸。
- Retrofitted Llama:具數學穩定性,支持積極外推。
這些模型證明循環重用權重保留前饋式推理階段,為測試時運算擴展提供理論基礎。
歷史脈絡定位
延續Universal Transformers的循環深度奠基,以及近期機率停止機制框架,此研究聚焦內部動態,證明推理階段為Transformer區塊本質湧現,而非純深度人工產物。相較傳統前饋堆疊,此轉向測試時運算動態調整每個token的運算預算,避免固定深度瓶頸。
外推限制與挑戰
儘管穩健,循環外推至未見測試時迭代不一致:Retrofitted Llama維持穩定,但Ouro出現結構漂移與「overthinking」問題,若無限循環即惡化。研究限於單一循環區塊,不適用多區塊序列循環;軌跡如「orbits」或「sliders」偵測依賴啟發式演算法,需手動調超參數,呼籲更嚴謹自動方法。
參數效率藍圖意義
此分析提供結構藍圖,驗證測試時運算擴展產生真實演算法階段,而非噪音,為高效推理引擎鋪路。作者立場強調,此脫離原始參數計數,開啟積極稀疏化穩定混合階段、壓縮循環MLP中間表示等優化。Sapunov認為,此為產業轉向動態推理模型的關鍵,控制循環固定點確保長思考產生優質而非失效答案,優於純參數擴展。
程式碼與深入資源
程式碼開源於github.com/TrelisResearch/nanochat/tree/recursive;完整機械剖析、數學證明與訓練穩定性詳見arxiviq.substack.com/p/a-mechanistic-analysis-of-looped。Sapunov以漫畫輔助說明,強調圖勝千token,並邀討論測試時運算vs參數擴展觀點。
此趨勢凸顯循環模型從混沌疑慮轉向可控結構,預示參數高效推理新时代,但外推不穩仍是硬邊界,呼應論文對架構設計的實務指引。
1/
— Grigory Sapunov (@che_shr_cat) April 20, 2026
We scale test-time compute by looping Transformer blocks. But do recurrent weights create chaotic noise or structured logic?
Turns out, they mathematically converge to fixed points that mimic physical network depth. 🧵 pic.twitter.com/ZEJvPfoaMk
2/
— Grigory Sapunov (@che_shr_cat) April 20, 2026
A Mechanistic Analysis of Looped Reasoning Language Models by Blayney, Arroyo, @AaronCourville, @mmbronstein, @pcastr et al.
They prove looped models decouple functional reasoning stages from raw parameter count. Here is how the mechanics actually work.
3/
— Grigory Sapunov (@che_shr_cat) April 20, 2026
The core theory: state convergence implies attention pattern convergence.
They bound the norm difference in attention outputs between recurrence steps. As the residual stream stabilizes, attention head mixing locks into a constant state.
4/
— Grigory Sapunov (@che_shr_cat) April 20, 2026
But there is a catch. If you just loop the latent state, representations collapse.
The fix? Input injection. Concatenating the original input with the current latent state before each loop is strictly necessary to maintain distinct operational stages.
5/
— Grigory Sapunov (@che_shr_cat) April 20, 2026
With input injection, the latent state does not collapse to a single point.
Instead, each layer in the recurrent block converges to a separate fixed point. The entire block traces a consistent cyclic orbit in latent space. pic.twitter.com/5TKVI1VPCO
6/
— Grigory Sapunov (@che_shr_cat) April 20, 2026
How do we measure this? The ColSum Concentration metric, or the entropy of attention matrix column sums.
It proves looped networks naturally self-organize into "stages of inference" that almost perfectly mirror massive feedforward networks. pic.twitter.com/2hZAu054sg
7/
— Grigory Sapunov (@che_shr_cat) April 20, 2026
It is not perfect yet. Extrapolating to unseen test-time recurrences is inconsistent.
While some models stay mathematically stable, others suffer from "overthinking" and structural drift if allowed to loop infinitely. Extrapolation remains a hard boundary.
8/
— Grigory Sapunov (@che_shr_cat) April 20, 2026
Why I think this matters: we now have a structural blueprint for parameter-efficient reasoning.
Scaling inference compute yields real algorithmic phases, not just noise. This opens the door to aggressive sparsification during stable mixing phases.
9/
— Grigory Sapunov (@che_shr_cat) April 20, 2026
Full deep dive into the mechanics, mathematical proofs, and training stability here: https://t.co/ZJ6eSHTcqN
Original paper: https://t.co/79bBd4KB9n
What are your thoughts on test-time compute vs parameter scaling? Let me know below.
10/
— Grigory Sapunov (@che_shr_cat) April 20, 2026
Comic as usual. Sometimes a picture is worth a thousand tokens. #MachineLearning #AI pic.twitter.com/igO43h5GqX
