UPDATED JUNE 30, 2026
UPDATED JUNE 30, 2026

The Frontier

Your signal. Your price.

Include
Lookback
||
No Priors: Artificial Intelligence | Technology | Startups
  • · 3d ago

    Noam Brown argues that model evaluation benchmarks are broken because they ignore test-time compute. Performance now depends on inference budget, but published grids use single-number scores without a token or cost axis.

  • · 3d ago

    Brown says GPT-3 could not effectively scale test-time compute. With a $10 million budget, GPT-3 performed nearly the same as with $1, making cost irrelevant for its capabilities.

  • · 3d ago

    Modern models like GPT-4.5 can improve for weeks before performance plateaus on some benchmarks. Brown says you cannot reasonably test until plateau, so evaluations must adopt a fixed budget or performance curve.

  • · 3d ago

    Brown cites OpenAI's internal disproving of the Erdős unit distance conjecture as an example of latent capability. He claims GPT-4.5 could have solved it with $100,000 in compute via scaffolding, but nobody explored that budget.

  • · 3d ago

    Safety frameworks like responsible scaling policies do not account for test-time compute, Brown states. Model dangerous capability is now a function of budget, not fixed, creating an unaddressed evaluation gap.

  • · 3d ago

    Brown says current models are not at the point where infinite inference budget yields superintelligence across all tasks. Factual retrieval plateaus quickly, while Sudoku improves indefinitely, showing a spectrum.

  • · 3d ago

    Brown uses poker bot creation as a personal evaluation. GPT-4.2 required steering but improved his code 10x; GPT-4.5 can now build a full solver with minimal guidance.

  • · 3d ago

    Brown says models lack research taste and cannot replace researchers yet. They accelerate some tasks 100x but bottleneck others, leading to gradual transformation rather than overnight replacement.

  • · 3d ago

    Large-scale test-time compute makes fast takeoff unlikely, Brown argues. Time becomes a bottleneck because models need long runtimes to unlock full capability, preventing an instantaneous intelligence explosion.

  • · 3d ago

    Brown states frontier labs understand the stakes and risks of their models. Competitive pressure exists, but researchers share a goal of steering toward positive outcomes.

  • · 3d ago

    Brown says he trusts GPT-4.5 outputs for high-stakes decisions like tax advice and condo paperwork more than expert human advice, indicating a shift in practical reliability.

  • · 3d ago

    Multi-agent systems are scratching the surface of capability, Brown believes. Frontier models are needed to unlock their potential, analogous to human civilization's accumulated knowledge scaffolding.

  • · 3d ago

    Benchmark gaming is easy by scaffolding multiple model runs or using judges, Brown warns. This inflates scores without controlling for test-time compute, making comparisons misleading.

  • · 3d ago

    Brown argues the research community agrees benchmarks should include a cost axis but is stuck in a bad equilibrium. Everyone publishes the grid because everyone expects it, despite knowing it's wrong.

End of 7-day results — 14 results
14 results