On the Role of Attention Masks and LayerNorm in Transformers

29 May 2024

Papers citing "On the Role of Attention Masks and LayerNorm in Transformers"

5 / 5 papers shown

Title
Lambda-Skip Connections: the architectural component that prevents Rank Collapse Federico Arangath Joseph Jerome Sieber M. Zeilinger Carmen Amo Alonso 33 0 0 14 Oct 2024
ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models N. Jha Brandon Reagen OffRL AI4CE 28 0 0 12 Oct 2024
On the Expressivity Role of LayerNorm in Transformers' Attention Shaked Brody Shiyu Jin Xinghao Zhu MoE 59 30 0 04 May 2023
Big Bird: Transformers for Longer Sequences Manzil Zaheer Guru Guruganesh Kumar Avinava Dubey Joshua Ainslie Chris Alberti ... Philip Pham Anirudh Ravula Qifan Wang Li Yang Amr Ahmed VLM 249 1,982 0 28 Jul 2020
Efficient Content-Based Sparse Attention with Routing Transformers Aurko Roy M. Saffar Ashish Vaswani David Grangier MoE 228 578 0 12 Mar 2020