Characterizing Feature Learning in Neural Networks: The Impact of Multiple Gradient Descent Steps

A new study explores feature learning pathways within two-layer neural networks operating in the linear-width regime, where dimensions scale proportionally to the number of hidden neurons and sample size. The paper contrasts single-step gradient descent updates, which are fundamentally limited to capturing only a single direction, with the results of a full two-step process.

The authors derive a spectral characterization showing that the updated weights behave as a spiked random matrix with multiple outliers, each corresponding to a distinct learned feature direction. Crucially, they find that the benefits of batch reuse—reusing training batches—enable the second update to successfully capture features even when the information exponent exceeds one. This persistence of optimization benefits is shown in the linear-width limit, suggesting a robust framework for understanding optimization phenomenology in overparameterized networks. The work proposes a tractable framework for analyzing optimization and feature learning evolution in modern deep learning models.

Characterizing Feature Learning in Neural Networks: The Impact of Multiple Gradient Descent Steps

More from this section

Characterizing Feature Learning in Neural Networks: The Impact of Multiple Gradient Descent Steps

More from this section