Your signal. Your price.

Noam Brown argues that model evaluation benchmarks are broken because they ignore test-time compute. Performance now depends on inference budget, but published grids use single-number scores without a token or cost axis.
Brown says GPT-3 could not effectively scale test-time compute. With a $10 million budget, GPT-3 performed nearly the same as with $1, making cost irrelevant for its capabilities.
Modern models like GPT-4.5 can improve for weeks before performance plateaus on some benchmarks. Brown says you cannot reasonably test until plateau, so evaluations must adopt a fixed budget or performance curve.
Brown cites OpenAI's internal disproving of the Erdős unit distance conjecture as an example of latent capability. He claims GPT-4.5 could have solved it with $100,000 in compute via scaffolding, but nobody explored that budget.
Safety frameworks like responsible scaling policies do not account for test-time compute, Brown states. Model dangerous capability is now a function of budget, not fixed, creating an unaddressed evaluation gap.
Brown says current models are not at the point where infinite inference budget yields superintelligence across all tasks. Factual retrieval plateaus quickly, while Sudoku improves indefinitely, showing a spectrum.
Brown uses poker bot creation as a personal evaluation. GPT-4.2 required steering but improved his code 10x; GPT-4.5 can now build a full solver with minimal guidance.
Brown says models lack research taste and cannot replace researchers yet. They accelerate some tasks 100x but bottleneck others, leading to gradual transformation rather than overnight replacement.
Large-scale test-time compute makes fast takeoff unlikely, Brown argues. Time becomes a bottleneck because models need long runtimes to unlock full capability, preventing an instantaneous intelligence explosion.
Brown states frontier labs understand the stakes and risks of their models. Competitive pressure exists, but researchers share a goal of steering toward positive outcomes.
Brown says he trusts GPT-4.5 outputs for high-stakes decisions like tax advice and condo paperwork more than expert human advice, indicating a shift in practical reliability.
Multi-agent systems are scratching the surface of capability, Brown believes. Frontier models are needed to unlock their potential, analogous to human civilization's accumulated knowledge scaffolding.
Benchmark gaming is easy by scaffolding multiple model runs or using judges, Brown warns. This inflates scores without controlling for test-time compute, making comparisons misleading.
Brown argues the research community agrees benchmarks should include a cost axis but is stuck in a bad equilibrium. Everyone publishes the grid because everyone expects it, despite knowing it's wrong.