NVIDIA introduces Gated DeltaNet-2, a novel linear attention layer that decouples the erasing and writing processes within the KV cache, achieving state-of-the-art performance across various language modeling and retrieval benchmarks.
Linear attention techniques face challenges in editing memory (KV cache) without scrambling existing associations. NVIDIA’s Gated DeltaNet-2 addresses this by decoupling the memory operation into a channel-wise erase gate (b_t on the key axis) and a channel-wise write gate (w_t on the value axis). Trained on 1.3B parameters using 100B FineWeb-Edu tokens, the model outperforms competitors including Mamba-2, Gated DeltaNet, KDA, and Mamba-3 across language modeling, commonsense reasoning, and long-context retrieval. The largest performance gains were observed in RULER S-NIAH and multi-key needle retrieval.