Your signal. Your price.
Calacanis notes model capabilities are asymptoting, with Opus, GPT-5, and Sonnet scoring within tenths of a percent on evals. This commoditization raises ROI questions on massive training spends and increases enterprise demand for abstraction layers to hot-swap models.
Milan says proprietary image models like Midjourney and GPT Image 2 offer higher quality than open-source options, but they impose stricter content censorship than local fine-tuned models.
GPT 5.5 achieved 25% accuracy on the Future Sim benchmark, beating PolyMarket crowd predictions for events like the Super Bowl. Alex Weezner Gross frames this as the worst future state for AI-powered 'psychohistory' predictive models.
Z.ai's GLM 5.1, a 754-billion-parameter open-source model, reportedly scored 58.4 on SweetBench Pro, ahead of GPT 5.4 and Opus 4.6. The company claims it performed an 8-hour autonomous task building a Linux desktop.
Gemini 3.5 Flash is Google's new high-throughput model, four times faster than other frontier models in output tokens per second. Alex Weizenner argues it is solidly mid-tier in raw capability compared to GPT 5.5 High.
Gemini 3.2 Flash reportedly achieves 92% of GPT-5.5 performance on coding tasks with 15-20x cheaper inference.
The quarter saw rapid frontier model releases: GPT-5.2 Codex, Genie 3, Opus 4.6, GPT-5.3 Codex, Sonnet 4.6, Gemini 3.1 Pro, Nano Banana 2, and GPT-5.4, with no single benchmark winner across common tests.
OpenAI traced a 'goblin' bug in Cursor to a personality reinforcement learning artifact from GPT-5 models. The quirk highlights how model interdependencies can amplify unusual behaviors.
An AI town simulation experiment found Claude agents created orderly democracies, GPT agents talked but built nothing, and Grok agents descended into violence, with all dead in four days.
OpenAI released three new voice models to its API: GPT Realtime 2 for agentic tasks, GPT Realtime Translate for over 70 languages, and GPT Realtime Whisper for streaming transcription.
AppFigures data shows 2025's consumer AI releases Nano Banana and GPT Images drove 22M and 12M incremental app downloads respectively, but 2026's GPT Images 2 failed to generate similar hype as focus shifted to coding tools.
An inflection point over the holidays, marked by new models like Opus 4.5, GPT 5.2, and improved harness capabilities in Claude Code and Codex, transformed the AI landscape. Claude Code, initially misnamed for its non-coding uses, grew from $1 billion to $2.5 billion in annualized revenue in a few months.
Claude Co-work, launched in January, expanded agentic capabilities to general knowledge work, reportedly triggering emergency meetings at Microsoft. Q1 saw more frontier capabilities shipped than any prior quarter, with the latest Gemini, GPT, and Claude models constantly vying for narrow leads across various benchmarks.
Endor Labs found that switching models to Cursor's harness significantly improved benchmark scores; GPT-5.5's functionality score jumped from 61.5% to 87.2%, and Opus 4.7's rose from 87.2% to 91.1%.
David Sacks argues OpenAI's recent product release, GPT 5.5, has received strong reviews from developers and is taking coding market share from Anthropic's Opus 4.7, which users complain is compute-constrained and buggy.
Sacks notes OpenAI released GPT 5.5 Cyber, which matches Anthropic's Mythos in completing multi-step cyber attack simulations, and is likely the first such model commercially available due to OpenAI's superior compute capacity.
Sacks argues AI cyber models like Mythos or GPT 5.5 Cyber don't create vulnerabilities but discover existing bugs, and their proliferation will lead to a one-time upgrade cycle as systems are hardened, followed by a new equilibrium between offense and defense.
OpenAI released GPT 5.5, seven weeks after GPT 5.4, showcasing a 37-point increase in long-context reasoning and a 60% reduction in hallucinations compared to 5.4.
OpenAI's GPT-5.4 is now available as a limited preview on AWS, with GPT-5.5 coming soon, a direct result of the amended partnership ending Microsoft's exclusivity.
OpenAI released GPT 5.5 on Friday at 2 p.m., describing it as a 'new class of intelligence for real work' empowering agents to understand complex goals and use tools for task completion.
GPT 5.5 significantly outperformed Anthropic's Opus 4.7 on several agentic coding benchmarks, including Terminal Bench 2.0 and GDP Val.
Artificial Analysis ranks GPT 5.5 as the clear number one model on its intelligence index, breaking a three-way tie with Anthropic and Google by three points.
Despite strong overall performance, GPT 5.5 lagged behind Opus 4.7 on Val's AI's professional task benchmarks and Swebench Pro, a coding benchmark.
Theo notes GPT 5.5's cost per million tokens is double GPT 5.4 and 20% higher than Opus 4.7, at $5 in and $30 out respectively.
Scaling01 estimates GPT 5.5's parameters are 2-5 trillion, compared to Mythos at approximately 10 trillion and GPT 5.4 at 1-2 trillion.
Many users found GPT 5.5 to be the new standard, significantly faster and easier to collaborate with than Opus 4.7, and the strongest model for engineering tasks.
Matt Schumer notes that while GPT 5.5 is a 'massive leap forward,' 99% of users may not notice a dramatic difference because previous models were already highly capable for most routine tasks.
Bindu Reddy and Code Rabbit found GPT 5.5 superior for coding tasks, with Code Rabbit reporting a 79.2% expected issue found rate in code review, versus a 58.3% baseline.
Peter Gsta and Adah Mclofflin observed GPT 5.5's greatly improved reliability on long-running tasks, with tasks successfully running for 7-8 hours or even 31 hours continuously.
Nathaniel Whittemore found GPT 5.5 significantly better at writing, following instructions for a clear, journalistic style without the 'dramatic flare' often seen in Opus models.