Improving Transformers with Dynamically Composable Multi-Head Attention

14 May 2024

Papers citing "Improving Transformers with Dynamically Composable Multi-Head Attention"

4 / 4 papers shown

Title
Mixture of Attention Heads: Selecting Attention Heads Per Token Xiaofeng Zhang Yikang Shen Zeyu Huang Jie Zhou Wenge Rong Zhang Xiong MoE 93 42 0 11 Oct 2022
Transformer Quality in Linear Time Weizhe Hua Zihang Dai Hanxiao Liu Quoc V. Le 71 220 0 21 Feb 2022
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 245 1,977 0 31 Dec 2020
Talking-Heads Attention Noam M. Shazeer Zhenzhong Lan Youlong Cheng Nan Ding L. Hou 89 79 0 05 Mar 2020