# 策展 · X (Twitter) 🔥

> 作者：OpenBMB (@OpenBMB) · 平台：X (Twitter) · 日期：2026-05-17

> 原始來源：https://x.com/OpenBMB/status/2055591416093684055

## 中文摘要

OpenBMB 推出 MiniCPM-V 4.6 與 MiniCPM-o 4.5，打造超高效邊緣多模態模型。

MiniCPM-V 4.6 僅 1.3B 參數，卻在 Artificial Analysis Intelligence Index 拿下 13 分，同時把 token 成本壓到 Qwen3.5-0.8B 的 1/19、Qwen3.5-0.8B-Thinking 的 1/43，成為目前最適合行動裝置部署的多模態模型。

**核心架構與效率設計**  
MiniCPM-V 4.6 以 SigLIP2-400M 視覺編碼器搭配 Qwen3.5-0.8B LLM 建構，繼承 MiniCPM-V 系列的單圖、多圖與影片理解能力，並導入 LLaVA-UHD v4 技術，將視覺編碼 FLOPs 降低 50% 以上，token 吞吐量提升至 Qwen3.5-0.8B 的 1.5 倍。模型支援 4x/16x 混合視覺 token 壓縮率，開發者可依需求在準確度與速度間切換。

**基準表現與跨平台部署**  
在 OpenCompass、RefCOCO、HallusionBench、MUIRBench、OCRBench 等視覺語言任務上，MiniCPM-V 4.6 達到 Qwen3.5 2B 等級表現。官方已開源 iOS、Android、HarmonyOS 三平台的 edge adaptation 程式碼，開發者只需數步即可在行動裝置上重現 on-device 體驗。

**安裝與推論指令**  
```bash
pip install "transformers[torch]>=5.7.0" torchvision torchcodec
# CUDA 12.8 專用
pip install "transformers>=5.7.0" torchvision torchcodec --index-url https://download.pytorch.org/whl/cu128
```

**模型載入範例**  
```python
from transformers import AutoModelForImageTextToText, AutoProcessor
model = AutoModelForImageTextToText.from_pretrained(
    "openbmb/MiniCPM-V-4.6",
    torch_dtype="auto",
    device_map="auto",
    attn_implementation="flash_attention_2"
)
processor = AutoProcessor.from_pretrained("openbmb/MiniCPM-V-4.6")
```

**圖片推論參數**  
```python
model.chat(
    image=image,
    question="Describe the image.",
    downsample_mode="16x",   # 或 "4x" 保留細節
    max_slice_nums=36,
    max_new_tokens=512
)
```

**影片推論參數**  
```python
model.chat(
    video=video_path,
    question="Describe the video.",
    downsample_mode="16x",
    max_num_frames=128,
    stack_frames=1,
    max_slice_nums=1,
    use_image_id=False,
    max_new_tokens=2048
)
```

**MiniCPM-o 4.5 全模態即時串流**  
MiniCPM-o 4.5 總參數 9B，結合 SigLip2、Whisper-medium、CosyVoice2 與 Qwen3-8B，OpenCompass 平均分 77.6，超越 GPT-4o 與 Gemini 2.0 Pro。模型支援中英雙語即時語音對話、聲音克隆與角色扮演，並具備全雙工多模態即時串流能力，可同時處理連續視訊與音訊輸入，同步輸出文字與語音。

**文件解析與語音生成表現**  
在 OmniDocBench 端到端文件解析任務中，MiniCPM-o 4.5-Instruct OverallEdit（EN）達 0.109，Read OrderEdit（EN）達 0.037，優於 Gemini-3 Flash 與 GPT-5。語音生成方面，seedtts test-zh CER 0.86%、seedtts test-en WER 2.38%，LongTTS-en WER 3.37%，均優於 CosyVoice2。

**推論效率與量化版本**  
bf16 格式下解碼速度 154.3 tokens/s、首 token 延遲 0.6 s、GPU 記憶體 19.0 GB；int4 格式則可達 212.3 tokens/s、11.0 GB 記憶體，適合消費級 GPU 部署。官方提供 GGUF、BNB、AWQ、GPTQ 多種量化版本。

**安裝與即時串流範例**  
```bash
pip install "transformers==4.51.0" accelerate "torch>=2.3.0,<=2.8.0" "torchaudio<=2.8.0" "minicpmo-utils[all]>=1.0.5"
```

**模型初始化**  
```python
import torch
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "openbmb/MiniCPM-o-4_5",
    trust_remote_code=True,
    attn_implementation="sdpa",
    torch_dtype=torch.bfloat16,
    init_vision=True,
    init_audio=True,
    init_tts=True,
)
model.eval().cuda()
model.init_tts()
duplex_model = model.as_duplex()
model = duplex_model.as_simplex(reset_session=True)
```

**全雙工串流流程**  
```python
results_log = []
for chunk in video_chunks:
    result = model.streaming_prefill(session_id="demo", msgs=[chunk], omni_mode=True)
    results_log.append(result)
    if result["is_listen"]:
        print("listen...")
    else:
        print(f"speak> {result['text']}")
model.streaming_generate(session_id="demo", generate_audio=True)
```

**多晶片部署支援**  
FlagOS 已釋出 MiniCPM-o 4.5 多晶片版本，涵蓋 Nvidia、Hygon、Metax、Iluvatar、Ascend、Zhenwu 六大架構，精度差異控制在 2% 以內，Nvidia 平台執行時間較 CUDA 提升 6%。使用者可直接從 FlagRelease 下載預裝映像檔，無需自行安裝依賴。

**開源資源與授權**  
權重與程式碼以 Apache-2.0 授權開源。  
- Hugging Face：https://huggingface.co/openbmb/MiniCPM-V-4.6  
- GitHub：https://github.com/OpenBMB/MiniCPM-V  
- ModelScope：https://modelscope.cn/models/OpenBMB/MiniCPM-V-4.6  
- Web Demo：https://huggingface.co/spaces/openbmb/MiniCPM-V-4.6-Demo  
- 技術報告：https://huggingface.co/papers/2604.27393  
- Cookbook：https://github.com/OpenSQZ/MiniCPM-V-Cookbook  

官方強調模型生成內容不代表開發者立場，且不對資料安全或誤用風險負責。

## 標籤

新產品, VLM, LLM, 開源專案, OpenBMB, Qwen