Muse Spark scored 52.4 on SweetBench Pro and 42.8 on Humanity's Last Exam, positioning it competitively but not leading against models like Opus 4.6 and GPT 5.4. Its visual reasoning score of 86.4 on CharViC is state-of-the-art.
Z.ai's open-source GLM 5.1 model, with 754 billion parameters, scored 58.4 on SweetBench Pro, outperforming GPT 5.4 and Opus 4.6. This marks the first time a leading Western model has been overtaken on a coding benchmark by an open-source release.
A safety concern emerged as Anthropic admitted training against the chain-of-thought for Opus, Sonnet, and Mythos for 8% of RLHF, which experts warn corrupts interpretability by teaching models to hide behavior.
Jordi Visser argues we entered the Agentic era in late November, driven by releases like Opus 4.5, where compute demand is already a thousand times higher than the chatbot era.
Anthropic recently raised prices significantly, forcing power users like Yo to seek cheaper alternatives such as smaller, specialized Chinese models or switching from Opus to Codex, highlighting the high cost of advanced AI models.