- 1d ago
Blitzy reported a 66.5% performance score on SWE-bench Pro, outperforming GPT 5.4's 57.7%, demonstrating how a sophisticated harness and context infrastructure can surpass raw model capability.
- 2d ago
Whittemore lists nine frontier AI models released in the last 90 days, including GPT 5.2 Codex, Genie 3, Opus 4.6, and GPT 5.4, noting that no single model wins all benchmarks.
- 5d ago
Muse Spark scored 52.4 on SweetBench Pro and 42.8 on Humanity's Last Exam, positioning it competitively but not leading against models like Opus 4.6 and GPT 5.4. Its visual reasoning score of 86.4 on CharViC is state-of-the-art.
- 5d ago
Z.ai's open-source GLM 5.1 model, with 754 billion parameters, scored 58.4 on SweetBench Pro, outperforming GPT 5.4 and Opus 4.6. This marks the first time a leading Western model has been overtaken on a coding benchmark by an open-source release.
Only 4 results for these filters — try broadening your search