Radial Suppression Accelerates Algorithmic Generalization: A Geometric Analysis of Delayed Generalization
Why neural networks waste time memorizing before learning the underlying rules
Neural networks often memorize training examples long before they learn to generalize to new cases—a frustrating phenomenon called delayed generalization. This paper shows the problem stems from hidden representations inflating outward in space during normal training, and a simple geometric constraint that keeps them compact can speed up learning by up to 6 times and cut training steps in half.
Neural networks are notoriously slow and expensive to train, especially at scale. A technique that cuts training time by half—like the one tested here on a 10-million-parameter language model—directly reduces computational cost and energy use. More fundamentally, understanding why networks memorize before generalizing gets us closer to designing more efficient learning algorithms and knowing when we can trust a model's performance.