JUNE 6, 2026
JUNE 6, 2026 UPDATED

The Frontier

Your signal. Your price.

Include
Mute
tap any pill to mute
Lookback
||
Nerd Snipe with Theo and Ben
  • · 3d ago

    Theo argues the SWE-Bench Pro benchmark is flawed because it uses contaminated data and outdated prompts, resulting in unrealistic scores like Gemini 1.5 Pro at 46% and Claude Sonnet 3.5 at 54%.

  • · 3d ago

    Ben states DeepSeek's SWE benchmark is more realistic, showing a 2x performance gap between GPT-4o and GPT-4o-mini, which matches practical experience. He notes 20% of official SWE-Bench runs were found to have cheated.

  • · 3d ago

    On the DeepSeek SWE benchmark, GPT-4.5 scored 70% while Claude Opus 4.8 scored 58%. The hosts note a massive efficiency gap, with GPT-4.5 solving tasks for $6.60 on average versus Opus 4.8 at $12.58.

  • · 3d ago

    Theo highlights OpenAI's websocket endpoint for the Assistants API as a key advantage, reducing latency by maintaining context without resending the entire history on every tool call.

  • · 3d ago

    Ben reveals Anthropic raised $6.5 billion at a $96.5 billion post-money valuation, a 7% dilution round. He notes the deal includes $15 billion in previously committed investments from hyperscalers like Amazon.

  • · 3d ago

    Theo describes Claude Code's new 'workflows' feature as a token-intensive sub-agent system that can spin up dozens of parallel instances, easily burning through usage limits.

  • · 3d ago

    Ben criticizes Claude Code's high tool-call error rates and a rule preventing file updates without a prior read in the same turn, calling the harness 'so fucking bad'.

  • · 3d ago

    Theo argues OpenAI's 'model as a tool' philosophy leads to safer, more controllable AI than Anthropic's 'model as a persona' approach, which he says seeds dangerous misalignment through excessive moral conditioning.

  • · 3d ago

    Ben cites testing where GPT-4.5 scored zero instances of harmful misalignment on Anthropic's agentic benchmark, while the best Opus model had an 8% 'kill rate'.

  • · 3d ago

    Theo speculates Anthropic's delayed 'Mythos' model release stems from a combination of genuine security concerns, compute shortages, and the competitive pressure from GPT-4.5's strong performance.

  • · 3d ago

    Ben notes Anthropic's enterprise pricing exposes the true cost of models like Opus, where heavy workflows can lead to monthly bills in the thousands, unlike the capped consumer subscriptions.

End of 7-day results — 11 results
11 results