Your signal. Your price.

Reiner Pope explains the fundamental unit of chip design is logic gates (AND, OR, NOT) connected by metal traces, with AI chips optimizing for matrix multiplication via multiply-accumulate (MAC) primitives.
Pope details a MAC circuit using a 4-bit multiplication with an 8-bit addition, citing higher precision in accumulation to counter rounding errors from summing many low-precision multiplications.
Partial products in a multiplier are generated by AND gates; for a PxQ-bit multiply, this requires P*Q AND gates. The core summing work uses full adders (3-to-2 compressors), with P*Q full adders needed in the general case.
Pope highlights quadratic scaling of circuit area with bitwidth, a key reason low-precision arithmetic works for neural nets. He notes Nvidia's B100/B200 specs now reflect this, with FP4 three times faster than FP8.
In pre-tensor-core GPUs and CPUs, most circuit area was spent on data movement (multiplexers selecting from register files) versus the actual logic unit, creating inefficiency.
Systolic arrays solve this by baking larger matrix-vector multiplication loops into hardware, storing weight matrices locally to reuse over many vectors, minimizing register file communication.
Pope describes chip clock cycles as global synchronization points; clock speed is limited by logic delay, and inserting pipeline registers splits logic to increase frequency at the cost of area.
FPGAs emulate ASIC logic using programmable lookup tables (LUTs) and multiplexers, but incur ~10x overhead because a LUT implementing a simple gate requires many more gates than a direct ASIC implementation.
CPU non-deterministic latency stems from design choices like caches, where hit/miss depends on ambient state. Scratchpad architectures (e.g., TPUs) give software explicit control over memory access for deterministic timing.
Pope contrasts GPU and TPU high-level organization: GPUs tile many small SM units (with tensor cores) across the die, while TPUs use fewer, coarser-grained matrix and vector units, enabling larger systolic arrays.
Most chip energy consumption comes from dynamic power - charging and discharging capacitors when bits toggle. Running slower reduces transitions but doesn't yield disproportionate efficiency gains.