Aviro is introducing Ebla, a state of the art grounded reasoning model.
— hud (@hud_evals) March 13, 2026
In collaboration with HUD, the Aviro team built C⁴ — a benchmark for long-horizon tasks in corporate document sets. We evaluate four dimensions: Correctness, Completeness, Composition, and Citations.… pic.twitter.com/BobNoXQbm4
The best frontier model, Claude Opus 4.6, scored 20.1%.
— hud (@hud_evals) March 13, 2026
Only 6.1% of frontier task-model pairs were full solves. These tasks are still hard for every model we tested.
The base GPT-OSS-120b scored 7.1% on C⁴. Post-training pushed it to 25.4%, a gain of +18.3 points. The biggest jump was on completeness, as the model learned to break multi-part questions into tractable searches.
— hud (@hud_evals) March 13, 2026
The frontier failure modes were also pretty consistent:
— hud (@hud_evals) March 13, 2026
1) confident answers when the corpus does not contain the requested information
2) visual misreads on diagrams and org charts
3) models invented intermediate values in cross-document arithmetic
Ebla was not just better, it was cheaper and more calibrated.
— hud (@hud_evals) March 13, 2026
It ran the full 40-task benchmark for $1.10 total inference cost vs. $24.74 for Opus 4.6, and it learned to answer under partial evidence instead of either fabricating or refusing.
To learn more, check their writeup…
