KellyBench 揭示頂尖 AI 模型在長期決策任務中表現不佳

General Reasoning

@GenReasoning

♥356🔁 33

𝕏 (Twitter)🔥🔥🔥🔥2026年4月9日

📎 來源文章 ↗查看原文 ↗

AI 語音朗讀 · Edge TTS

AI 中文摘要Claude 生成

KellyBench 揭示頂尖 AI 模型在長期決策任務中表現不佳。

「KellyBench」是一個針對長期、非靜態環境設計的評測基準，旨在測試 AI 模型在真實體育博彩市場中的序列決策能力，結果顯示目前所有頂尖模型皆無法穩定獲利。

評測機制與挑戰
與傳統短週期、任務導向的評測不同，「KellyBench」要求模型在長達一整個賽季（2023-24 英超聯賽）的模擬環境中運作。模型需處理數百萬個 token，並進行 500 至 1,000 次工具呼叫，以完成包含資料分析、機器學習模型開發、風險管理及動態調整策略的複雜任務。這不僅是技術能力的考驗，更是對模型在不確定環境下長期規劃與適應能力的極限測試。

模型表現與財務結果
評測結果顯示，目前頂尖模型在該環境中表現慘澹：

所有接受測試的頂尖模型在賽季結束時均呈現虧損，多數模型甚至面臨破產。
表現最佳的「Claude Opus 4.6」與「GPT-5.4」是唯二未在所有測試中破產的模型，但其平均投資報酬率（ROI）仍分別為 -11% 與 -14%。
模型普遍缺乏連貫的長期行為，常出現無法將分析轉化為實際行動，或在環境變化時無法及時調整策略的問題。

策略複雜度與改進空間
為了排除回測變異數的影響，研究團隊諮詢量化博彩專家，建立了一套 44 項指標的「複雜度評分」（Sophistication Score）來評估模型策略。結果發現：

所有模型的策略複雜度均極低，最高分的「Claude Opus 4.6」僅獲得 32.6% 的分數。
數據顯示，複雜度評分與 ROI 呈正相關，且與破產率呈負相關，證明模型仍有極大的優化空間。
即使給予模型相關歷史文獻或使用更進階的程式開發工具，也未能顯著改善其虧損狀況，顯示模型在處理長期、開放式目標時存在系統性的能力缺口。

對 AI 評測趨勢的反思
研究團隊指出，當前 AI 領域過度聚焦於程序性任務，忽略了真實世界中對長期、不確定性決策的需求。他們呼籲評測文化應從固定的任務集轉向「複雜世界」（Complex Worlds），即透過模擬真實世界中需要持續學習與適應的環境，來測試 Agent 的真實能力。這項研究不僅是針對博彩市場的測試，更是對未來 Agent 系統在真實世界部署能力的一次嚴峻預警。

🎲 Introducing KellyBench, a new long-horizon evaluation for frontier models.

KellyBench evaluates models within a year long sports betting market, a challenging and highly non-stationary environment.

Every frontier model we test loses money. They struggle to design ML… pic.twitter.com/g7SCvyy3Va
— General Reasoning (@GenReasoning) April 9, 2026

Links for the release post and paper:

- Release: https://t.co/Vu0qVlaSdg
- Paper: https://t.co/T09rjY1lSF

Analysis continues in the thread below.
— General Reasoning (@GenReasoning) April 9, 2026

The dominant evaluation paradigm is task centric, short-horizon, and sparse reward.

KellyBench is a different beast.

Models on KellyBench take 500-1000 tool calls to complete an episode, use hundreds of millions of tokens, and receive dense rewards each week.

It's also… pic.twitter.com/V6oPkUk3qA
— General Reasoning (@GenReasoning) April 9, 2026

📉 Frontier models struggle on KellyBench.

We find there is a large gap between the ability to develop machine learning strategies, and executing and adapting them over long-horizons.

Only Opus 4.6 and GPT-5.4 manage to avoid ruin. Every other model goes bankrupt at least once… pic.twitter.com/FnPz8sUZov
— General Reasoning (@GenReasoning) April 9, 2026

One potential critique of our setup is that betting markets are highly efficient, and models might be doing the best they can with the data they are given.

But we constructed expert rubrics and found models were unsophisticated and heavily underutilising available data. The best… pic.twitter.com/r4Sq7yuQbg
— General Reasoning (@GenReasoning) April 9, 2026

What is the implication of our results?

We think better performance is possible on KellyBench with more inference compute. For example evolutionary "autoresearch" approaches to feature development, or multi-agent scaffolds that simulate different roles in a quantitative fund.…
— General Reasoning (@GenReasoning) April 9, 2026