This research investigates how features are learned in two-layer neural networks under linear-width constraints, comparing single-step versus multi-step gradient descent. It demonstrates that reusing batches allows the model to capture multiple learned directions, extending the b
A new study explores feature learning pathways within two-layer neural networks operating in the linear-width regime, where dimensions scale proportionally to the number of hidden neurons and sample size. The paper contrasts single-step gradient descent updates, which are fundamentally limited to capturing only a single direction, with the results of a full two-step process.
The authors derive a spectral characterization showing that the updated weights behave as a spiked random matrix with multiple outliers, each corresponding to a distinct learned feature direction. Crucially, they find that the benefits of batch reuse—reusing training batches—enable the second update to successfully capture features even when the information exponent exceeds one. This persistence of optimization benefits is shown in the linear-width limit, suggesting a robust framework for understanding optimization phenomenology in overparameterized networks. The work proposes a tractable framework for analyzing optimization and feature learning evolution in modern deep learning models.