NVIDIA Releases Gated DeltaNet-2: Decoupling Erase and Write in the Delta Rule

Linear attention techniques face challenges in editing memory (KV cache) without scrambling existing associations. NVIDIA’s Gated DeltaNet-2 addresses this by decoupling the memory operation into a channel-wise erase gate (b_t on the key axis) and a channel-wise write gate (w_t on the value axis). Trained on 1.3B parameters using 100B FineWeb-Edu tokens, the model outperforms competitors including Mamba-2, Gated DeltaNet, KDA, and Mamba-3 across language modeling, commonsense reasoning, and long-context retrieval. The largest performance gains were observed in RULER S-NIAH and multi-key needle retrieval.

NVIDIA Releases Gated DeltaNet-2: Decoupling Erase and Write in the Delta Rule

More from this section

NVIDIA Releases Gated DeltaNet-2: Decoupling Erase and Write in the Delta Rule

More from this section