Price:

AI & TECH

Equity-for-compute deals reveal AI's physical wall

Friday, May 1, 2026 · from 3 podcasts
  • Labs trade ownership for power and cables, as seen in Anthropic's $73B deals with Google and Amazon.
  • Hardware shortages force reliance on three-year-old chips while criminal AI scales.
  • Rack cable density now dictates model architecture, not algorithmic breakthroughs.

Anthropic traded future equity for $73 billion in cloud compute because the true currency is no longer cash, but physical infrastructure. On Moonshots, the analysts argued these deals secure survival, not just capital, as labs face a recursive bottleneck: you need chips to build models, and models to afford chips.

This supply crunch has become a throttle. The Intelligence reports that NVIDIA chips are sold out, forcing firms to use 2-3 year old hardware, while lead times for electrical transformers stretch to five years. Money is chasing a wall of land, water, and power.

"The supply crunch acts as a natural brake on the industry. Tech giants like Amazon and Microsoft are spending $700 billion on data centers this year, but money cannot buy land, water, or electricity where local opposition is mounting."

- The Intelligence from The Economist

The physical constraints are reshaping technical design. Reiner Pope explains on the Dwarkesh Podcast that high-speed Mixture of Experts models are limited by rack size. The all-to-all communication between experts works only within a single rack's dense NVLink network; crossing to another rack is eight times slower.

Cable density and bend radius, not compute theory, now cap the number of experts a model can use. Pope notes the leap from Nvidia’s Hopper to Blackwell was less about the chip and more about how many GPUs could fit within the same high-speed cable domain.

"The primary constraint on increasing rack size is physical: cable density, bend radius, weight, and cooling, not a fundamental technical barrier."

- Reiner Pope, Dwarkesh Podcast

The race is becoming vertically integrated. Google, which already controls an estimated 25% of global AI compute, is using its own AI to design its eighth-generation TPU chips. The goal is total silicon-to-software sovereignty, maximizing economic value per token from a owned stack.

While labs fight over physical real estate, automated crime scales alongside them. Criminal groups use AI-powered 'malware as a service' to steal biometrics and drain accounts, a $500 billion industry evolving faster than defenses. The frontier is splitting between those who control the physical layer and those left to rent it.

Source Intelligence

- Deep dive into what was said in the episodes

Google Invests $40B Into Anthropic, GPT 5.5 Drops, and Google Cloud Dominates | EP #252Apr 30

  • Anthropic trades massive equity for infrastructure access as the training bottleneck shifts to power and fabs.
  • Frontier models are self-improving at a rate that renders human-led benchmarking nearly obsolete.
  • Google’s eighth-gen TPUs, designed by AI, signal a shift toward total silicon-to-software integration.

Reiner Pope – The math behind how LLMs are trained and servedApr 29

  • Reiner Pope explains that batch size is the key variable driving the trade-off between inference latency and cost. Batching amortizes the fixed cost of fetching model weights across many user requests.
  • Without batching, serving a large model is uneconomical. Pope states the cost can be a thousand times worse than when batching just two users together.
  • A roofline model for inference time combines compute time and memory fetch time. Compute time scales linearly with batch size, while memory time includes a constant for weights and a term linear in batch size for the KV cache.
  • There is a hard lower bound on inference latency set by the time needed to read all the model's total parameters from memory into the chips, which is independent of batch size.
  • Pope solves for the batch size where compute and memory times are balanced. The formula is batch size >= (Flops / Memory Bandwidth) * (Active Params / Total Params), where the hardware ratio Flops/Bandwidth is ~300.
  • This balance point implies the optimal batch size is approximately 300 times the model's sparsity ratio. For DeepSeek's sparsity of 32/256, this yields a batch size around 2000-3000 tokens.
  • In a scheduled system, a new inference 'train' departs every 20 milliseconds. Worst-case latency for a user is 40ms if they just miss a departure and must wait for the next train to complete.
  • The 20ms schedule is derived from the time to read the entire HBM capacity. For a Rubin-generation system with 288GB HBM and 20 TB/s bandwidth, this is about 15ms.
  • Pope argues increasing sparsity is a pure win for inference cost, as it reduces the active parameters and thus compute time. However, it demands larger batch sizes to amortize weight fetches and consumes more memory capacity.
  • Mixture-of-experts layers use expert parallelism, where different experts are placed on different GPUs. This creates an all-to-all communication pattern that is optimal within a single rack's high-bandwidth scale-up network.
  • Leaving the rack uses a scale-out network about eight times slower than the internal NVLink. This makes crossing rack boundaries for expert parallelism a severe bottleneck.
  • Pope states the primary constraint on increasing rack size is physical: cable density, bend radius, weight, and cooling, not a fundamental technical barrier.
  • Pipeline parallelism, which places different model layers on different racks, is viable for inference because the communication pattern is point-to-point rather than all-to-all, making scale-out latency manageable.
  • Pope argues the value of large scale-up domains like Google's or NVIDIA's Rubin is not primarily memory capacity, but memory bandwidth, which directly lowers inference latency and enables longer context lengths.
  • He presents a heuristic cost model for model development: total cost = pre-training cost + RL cost + inference cost. He conjectures labs roughly equalize these three costs.
  • Applying this model, Pope estimates frontier models are overtrained by a factor of about 100 relative to the compute-optimal Chinchilla scaling law, due to the need to amortize training compute over vast inference usage.
  • Pope reverse-engineers API pricing to deduce system bottlenecks. Gemini charging more for contexts over 200K tokens suggests a memory-to-compute crossover point near that length.
  • Output tokens being ~5x more expensive than input tokens indicates decode is memory-bandwidth bound, while pre-fill is compute-bound, as pre-fill amortizes memory costs over many tokens.
Also from this episode: (2)

Models (1)

  • Empirical research on mixture-of-experts shows model quality can increase with sparsity. An older paper found a 64-expert model with 270M active parameters matched the quality of a dense 1.3B parameter model.

AI & Tech (1)

  • Pipelining reduces the memory capacity needed per rack for model weights but does not reduce the memory needed for the KV cache, which becomes the dominant memory consumer.

Power ranges: AI faces supply crunchApr 29

  • OpenAI shut down its Sora video generation tool to allocate scarce computing resources toward more lucrative ventures, reflecting an industry-wide AI compute shortage.
  • Weekly AI token processing on Open Router quadrupled from January to March 2024, illustrating surging AI demand that hardware cannot match.
  • Five major U.S. cloud providers, including Amazon, Meta, and Microsoft, will spend close to $700 billion on AI data center buildouts this year.
  • Data center construction faces local opposition over electricity, land, and water usage, causing project delays amid the urgent AI capacity push.
  • NVIDIA supplies over two-thirds of the world's AI processing power, but its chips are sold out, forcing companies to use older 2-3 year old hardware.
  • TSMC is the sole manufacturer for most advanced AI chips. Its capital expenditures are increasing by $60 billion this year, but capacity remains constrained.
  • Elon Musk's proposed 'TerraFab' aims to exceed all current chip fabrication capacity by 2030, a project analysts estimate would cost $5 to $13 trillion.
  • A prolonged AI supply crunch could reverse the trend of falling inference prices, leading to higher costs for users and potentially slowing AI adoption.
Also from this episode: (6)

AI & Tech (5)

  • A sophisticated spyware attack in Indonesia used a fake tax app to steal biometric data and drain over $26,000 from a charity accountant's bank accounts.
  • Criminal groups now operate a 'malware as a service' model, buying and selling stolen data and malicious software on platforms like Telegram to execute rapid, personalized attacks.
  • The global cybercrime industry is estimated to generate $500 billion annually, a scale comparable to the global illicit drug trade.
  • Security firm Infoblox identified a software cluster targeting victims in over 20 countries, with criminals integrating AI chatbots and deepfake tools to enhance attacks.
  • Allbirds is abandoning its footwear business, selling all shoe assets and rebranding as Newbird AI to pivot towards AI compute infrastructure.

Business (1)

  • Millennial-focused direct-to-consumer brands like Allbirds face pressure from rising interest rates, expensive online ad markets, and competition from larger, established companies.