from No Priors: Artificial Intelligence | Technology

3d ago

Noam Brown argues that model evaluation benchmarks are broken because they ignore test-time compute. Performance now depends on inference budget, but published grids use single-number scores without a token or cost axis.

3d ago

Brown says GPT-3 could not effectively scale test-time compute. With a $10 million budget, GPT-3 performed nearly the same as with $1, making cost irrelevant for its capabilities.

3d ago

Modern models like GPT-4.5 can improve for weeks before performance plateaus on some benchmarks. Brown says you cannot reasonably test until plateau, so evaluations must adopt a fixed budget or performance curve.

3d ago

Brown cites OpenAI's internal disproving of the Erdős unit distance conjecture as an example of latent capability. He claims GPT-4.5 could have solved it with $100,000 in compute via scaffolding, but nobody explored that budget.

3d ago

Safety frameworks like responsible scaling policies do not account for test-time compute, Brown states. Model dangerous capability is now a function of budget, not fixed, creating an unaddressed evaluation gap.

3d ago

Brown says current models are not at the point where infinite inference budget yields superintelligence across all tasks. Factual retrieval plateaus quickly, while Sudoku improves indefinitely, showing a spectrum.

3d ago

Brown uses poker bot creation as a personal evaluation. GPT-4.2 required steering but improved his code 10x; GPT-4.5 can now build a full solver with minimal guidance.

3d ago

Brown says models lack research taste and cannot replace researchers yet. They accelerate some tasks 100x but bottleneck others, leading to gradual transformation rather than overnight replacement.

3d ago

Large-scale test-time compute makes fast takeoff unlikely, Brown argues. Time becomes a bottleneck because models need long runtimes to unlock full capability, preventing an instantaneous intelligence explosion.

3d ago

Brown states frontier labs understand the stakes and risks of their models. Competitive pressure exists, but researchers share a goal of steering toward positive outcomes.

3d ago

Brown says he trusts GPT-4.5 outputs for high-stakes decisions like tax advice and condo paperwork more than expert human advice, indicating a shift in practical reliability.

3d ago

Multi-agent systems are scratching the surface of capability, Brown believes. Frontier models are needed to unlock their potential, analogous to human civilization's accumulated knowledge scaffolding.

3d ago

Benchmark gaming is easy by scaffolding multiple model runs or using judges, Brown warns. This inflates scores without controlling for test-time compute, making comparisons misleading.

3d ago

Brown argues the research community agrees benchmarks should include a cost axis but is stuck in a bad equilibrium. Everyone publishes the grid because everyone expects it, despite knowing it's wrong.