Momentum Transformer: Closing the Performance Gap Between Self-attention
and Its Linearization

Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization

1 August 2022

Richard G. Baraniuk

Robert M. Kirby

Stanley J. Osher

Bao Wang

Papers citing "Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization"

6 / 6 papers shown

Title
Transformer Meets Twicing: Harnessing Unattended Residual Information Laziz U. Abdullaev Tan M. Nguyen 41 2 0 02 Mar 2025
MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts R. Teo Tan M. Nguyen MoE 33 3 0 18 Oct 2024
Breaking the Attention Bottleneck Kalle Hilsenbek 86 0 0 16 Jun 2024
How Does Momentum Benefit Deep Neural Networks Architecture Design? A Few Case Studies Bao Wang Hedi Xia T. Nguyen Stanley Osher AI4CE 37 10 0 13 Oct 2021
Big Bird: Transformers for Longer Sequences Manzil Zaheer Guru Guruganesh Kumar Avinava Dubey Joshua Ainslie Chris Alberti ... Philip Pham Anirudh Ravula Qifan Wang Li Yang Amr Ahmed VLM 282 2,015 0 28 Jul 2020
Efficient Content-Based Sparse Attention with Routing Transformers Aurko Roy M. Saffar Ashish Vaswani David Grangier MoE 243 580 0 12 Mar 2020