研究顯示平行採樣在大型推理模型中優於序列採樣
研究顯示平行採樣在大型推理模型中優於序列採樣。
Google DeepMind 的研究指出,在大型推理模型 (LRMs) 進行數學與程式撰寫任務時,平行採樣的表現通常優於序列採樣,且關鍵原因在於序列採樣缺乏足夠的探索性。
研究背景與發現
研究針對「Qwen3」、「DeepSeek-R1」蒸餾模型及「Gemini 2.5」進行測試。研究發現,在數學與程式撰寫任務中,平行採樣(Parallel Sampling)的效果普遍優於序列採樣(Sequential Sampling),儘管後者理論上應具備更強的表達能力。
驗證假設與排除因素
研究團隊提出了三個假設來解釋效能差距,並透過實驗進行驗證:
- 聚合運算(Aggregation):假設平行採樣因使用了聚合運算而勝出。實驗顯示,即便為序列採樣加入相同的聚合機制,效能差距依然存在。
- 上下文長度(Context Length):假設序列採樣因需處理更長的輸入上下文而受限。實驗在平行採樣中加入長度相當或更長的無關上下文,結果顯示這並未影響平行採樣的效能。
效能差距的核心原因
研究證實,序列採樣的主要問題在於「探索不足」。
- 序列採樣會導致模型對先前的解法過度自信,傾向於重複相同的路徑。
- 使用 Embedding 模型分析顯示,平行採樣產生的解法之間相似度較低,代表其探索範圍更廣。
- 機制可解釋性分析(Mechanistic Interpretability)發現,序列採樣的上下文中存在「歸納頭」(Induction Heads),導致模型傾向於複製先前的解法,進而抑制了對新解法的探索。
實務建議與改善方向
研究團隊針對提升序列採樣效能提出具體建議:
- 擴展環境回饋(Environment Feedbacks):若能提供高品質的回饋(如利用隱藏測試案例來偵測執行錯誤),序列採樣的效能可追平平行採樣。
- 採用拒絕採樣(Reject Sampling):透過主動探索新的替代方案來優化結果。
[1/9] Glad to share another project I have done during my time at @GoogleDeepMind, which actually was finished a half year ago.
— Xiangming Gu (@gu_xiangming) April 10, 2026
Shall we use sequential sampling or parallel sampling for test-time-scaling?
We find that parallel sampling typically tends to be better than… pic.twitter.com/3bm4CdfQCk
[2/9] For math and coding, we find that in Qwen3, DeepSeek-R1 distilled models, Gemini 2.5, parallel sampling tends to be better than sequential sampling. pic.twitter.com/cs67ueBzkt
— Xiangming Gu (@gu_xiangming) April 10, 2026
[3/9] Our first hypothesis is that parallel sampling uses an aggregation while sequential sampling always takes the final one. So let's add the same aggregation to parallel sampling. And we found that the performance gap still exists. pic.twitter.com/S5fc8hZ7rX
— Xiangming Gu (@gu_xiangming) April 10, 2026
[4/9] Our second hypothesis is that in sequential sampling, LLMs need to handle extended input context. Therefore, we add irrelevant context with similar or even longer length to parallel sampling. However, this doesn't affect the performance (including the aggregation… pic.twitter.com/X3RAt0WHlQ
— Xiangming Gu (@gu_xiangming) April 10, 2026
[5/9] Our final hypothesis is that in sequential sampling, LLMs become "lazy" to explore alternatives. LLMs become over-confident about their previous solutions and repeat them. Using embedding models, our results indicate that solutions generated through parallel sampling have… pic.twitter.com/mNb6ewjX9g
— Xiangming Gu (@gu_xiangming) April 10, 2026
[6/9] In the sequential sampling, we can conduct parallel sampling start with certain rounds. With previous solutions in the context, performance of parallel sampling even decreases. pic.twitter.com/pK8wyUQIkB
— Xiangming Gu (@gu_xiangming) April 10, 2026
[7/9] Our mech interpretability analysis shows that with previous solutions in the context, there are induction heads inside sequential sampling. LLMs may copy previous solutions to the current generations, thus affecting the explorations. pic.twitter.com/7SW5x3bTd1
— Xiangming Gu (@gu_xiangming) April 10, 2026
[8/9] Our actionable suggestion is to scale environment feedbacks. But we also find that with very strong/high-quality feedbacks, like using hidden test cases for running errors, sequential sampling can keep up with parallel sampling. Another actionable suggestion is to use… pic.twitter.com/uNTrKrk88O
— Xiangming Gu (@gu_xiangming) April 10, 2026
[9/9] Thanks for @sohamde_, @re_rayne, @PetarV_93, Razvan Pascanu's guidance and collaborations.
— Xiangming Gu (@gu_xiangming) April 10, 2026
BTW, the paper link is: https://t.co/N1CliqDFpe
— Xiangming Gu (@gu_xiangming) April 10, 2026
