# 策展 · X (Twitter) 🔥🔥🔥

> 📖 本站完整內容索引（documentation index）：[llms.txt](/llms.txt)

> 作者：Vals AI (@ValsAI) · 平台：X (Twitter) · 日期：2026-06-17

> 原始來源：https://x.com/ValsAI/status/2066997518089801784

## 中文摘要

Vals AI 發布程式碼遷移基準測試評估語言模型。

**核心評估目標**
程式碼遷移（Code Migration）不僅是技術轉換，更具備顯著的經濟影響力。由於全球銀行、薪資系統及政府服務仍高度依賴 COBOL 等舊有語言，確保 AI 在遷移程式碼時不會「看起來正確但悄悄遺失關鍵行為」至關重要。Vals AI 此次推出的基準測試，旨在透過嚴格的離線「沙盒」環境，驗證模型是否能真正重構程式邏輯，而非僅是進行表面的語言轉換。

**測試方法與防弊機制**
為了確保評估的公正性，該基準測試採用了多層次的防弊與驗證手段：
- **行為導向評分**：不比較程式碼相似度，而是透過隱藏的行為測試套件，檢查重構後的程式是否能通過功能驗證。
- **嚴格防弊檢查**：系統會自動將以下行為歸零（零分）：
    - 僅對原始執行檔進行封裝（Wrapper）。
    - 複製參考檔案（Reference artifacts）。
    - 提交錯誤的目標語言程式碼。
    - 硬編碼（Hardcoding）預期輸出。
- **任務類型**：包含「現代語言間遷移」（如 Python、Java、Kotlin、Rust、C++）以及「COBOL 轉 Java」的舊系統現代化任務。

**模型效能與成本分析**
根據 [Vals AI 基準測試頁面](http://vals.ai/benchmarks/code-migration) 的數據，模型在準確度與成本之間呈現顯著的權衡：
- **效能領先者**：`Claude Fable 5` 在整體表現上以 55% 的通過率位居第一，特別是在 CLI 遷移任務中表現突出（60.1%），但其測試成本高達每次 115.43 美元。
- **高性價比選擇**：`GPT 5.5` 以 45% 的通過率展現了極佳的效率，單次測試成本僅 6.44 美元，是目前最具成本效益的選擇。
- **開源模型表現**：`Kimi K2.6` 以 28% 的通過率成為表現最好的開源模型，單次測試成本為 5.12 美元，排名甚至超越了部分前沿模型。
- **品質與正確性差異**：研究發現「正確性」與「程式碼品質」並不完全掛鉤，例如 `Claude Opus 4.8` 雖然在通過率上為 47%，但其撰寫的程式碼品質評分（平均 4.6/5）卻是所有模型中最高的。

**產業背景與啟示**

<video src="https://pub-75d4fe1e4e80421b9ecb1245a7ae0d1a.r2.dev/curated/1781664891416-pkcyrjr3.mp4" poster="https://pub-75d4fe1e4e80421b9ecb1245a7ae0d1a.r2.dev/curated/27b724eb40672988.jpg" controls playsinline preload="metadata" style="max-width:100%;height:auto;display:block;margin:1rem 0"></video>
> 三位講者探討 AI 模型在程式碼遷移任務中的效能基準測試。

正如相關技術討論中所強調，程式碼遷移是現代軟體基礎設施維護的關鍵挑戰。從歷史上的大型主機系統到現代應用，AI 的角色已從單純的程式撰寫，轉向協助企業處理高風險的系統現代化任務。儘管目前 `Claude Fable 5` 在複雜任務中展現了強大的遷移能力，但對於企業而言，如何在維護系統穩定性與控制遷移成本之間取得平衡，仍是導入 AI 進行自動化遷移時必須考量的核心問題。

## 媒體內容

**三位講者探討 AI 模型在程式碼遷移任務中的效能基準測試。**

**影片中的 Prompt 與操作**

操作步驟：

1. （00:10）畫面展示 1965 年 IBM System/360 終端機介面
2. （00:16）畫面展示 1971 年 IBM 3270 終端機介面
3. （00:29）畫面展示 1987 年 Macintosh II 介面
4. （00:45）畫面展示 1995 年 Windows 95 介面
5. （01:41）畫面展示 2004 年 Windows Media Player 介面
6. （02:00）畫面展示 2012 年 iPhone 5 介面

**逐字稿**

- `00:01` 幾個月前，一篇關於大型語言模型遷移程式碼的部落格文章，讓 IBM 損失了數百億美元。（A few months ago, one blog post about LLM's migrating code code cost IBM tens of billions of dollars.）
- `00:07` 這是自 2000 年以來最嚴重的股價跌幅。（Its worst stock drop since 2000.）
- `00:11` 程式碼是現代基礎設施的骨幹，且需要不斷地進行變更。（Code is the backbone of modern infrastructure, and it needs to be constantly changing.）
- `00:17` 程式碼遷移的重點在於，在變更底層語言的同時，保留程式碼原有的功能。（Code migrations are about preserving the functionality of code while changing the underlying language.）
- `00:21` 風險在於模型可能會遺漏邊緣案例，交付看起來正確的程式碼，（The risk is that a model might miss an edge case, ship code that looks right,）
- `00:25` 但實際上卻在生產環境最關鍵的時刻導致系統崩潰。（but then actually breaks the system in production when it matters most.）
- `00:31` 我是 Ryan。（I'm Ryan.）
- `00:32` 我是 Dohan。（I'm Dohan.）
- `00:32` 我是 Hung。（I'm Hung.）
- `00:33` 來自 Val's AI。（From Val's AI.）
- `00:34` 我們評估模型執行實際工作的能力。（We measure models' ability to do real work.）
- `00:37` 程式碼遷移是最明確的測試之一。（Code migrations is one of the clearest tests.）
- `00:39` 模型能否在不喪失原有運作機制的情況下遷移系統？（Can a model migrate a system without losing what made it work?）
- `00:42` 這就是我們建立此基準測試的原因。（That's why we built this benchmark.）
- `00:45` 我們採用真實的程式庫，將其連同建置工具和可執行二進位檔提供給模型，（So we take a real code base, give it to the model along with the build tools and executable binary,）
- `00:52` 並要求它將整個程式庫遷移到新的語言。（and ask it to migrate the whole code base into a new language.）
- `00:56` 它必須回傳一個能在現代環境中執行的有效建置版本。（It has to return a working build that can be run on a modern environment.）
- `01:01` 我們測試現代任務，例如將 C++ 遷移至 Rust，一直到將 COBOL 遷移至 Java 的舊系統遷移。（We test modern tasks like migrating C++ to Rust all the way to legacy migrations of COBOL to Java.）
- `01:07` 我們針對兩件事進行評分。（We score two things.）
- `01:08` 首先，我們驗證程式碼是否仍能運作，並針對模型從未見過的案例進行檢查。（First, we verify that the code still works, checking against cases that the model never saw.）
- `01:12` 其次，程式碼品質如何？（And second, is the code any good?）
- `01:14` 最終產出的程式碼是否適合作為新功能的基準，還是工程師根本不想碰的垃圾程式碼？（Is the final product a good baseline for new features, or is it slop that no engineer will ever touch?）
- `01:20` 我們觀察到一個明顯的模式。（We saw a clear pattern.）
- `01:22` 舊系統到現代系統的遷移往往容易失敗，因為相關資料遠較稀少。（Legacy to modern migrations is where they tend to break because there's far less data.）
- `01:27` 因此，模型在資料庫整合、協定語意和商業邏輯等方面容易出錯。（So models trip up on things like database integration, protocol semantics, and business logic.）
- `01:33` 這正是風險最高的地方。（This is where the stakes are the highest.）
- `01:36` 從現代系統到現代系統的遷移，模型表現較佳，因為有更多的訓練資料。（From modern to modern migrations, models do better because there's more training data.）
- `01:40` 但這裡仍有一段路要走。（But there's still a step here to climb.）
- `01:44` 人工智慧將撰寫未來的程式碼。（AI will write the code of the future.）
- `01:46` 現在的問題是，它們是否能夠維護世界已經依賴的系統。（The question here is whether they can maintain the systems that the world already relies on.）

## 標籤

Benchmark, 新產品, LLM, AIGC, Vals AI