Mixture of Attention Heads: Selecting Attention Heads Per Token

Mixture of Attention Heads: Selecting Attention Heads Per Token

11 October 2022

Jie Zhou

Papers citing "Mixture of Attention Heads: Selecting Attention Heads Per Token"

6 / 6 papers shown

Title
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing Piotr Piekos Róbert Csordás Jürgen Schmidhuber MoE VLM 88 0 0 01 May 2025
MambaMoE: Mixture-of-Spectral-Spatial-Experts State Space Model for Hyperspectral Image Classification Yichu Xu Di Wang Hongzan Jiao L. Zhang L. Zhang Mamba 80 0 0 29 Apr 2025
RouterKT: Mixture-of-Experts for Knowledge Tracing Han Liao Shuaishuai Zu 31 0 0 11 Apr 2025
Layerwise Recurrent Router for Mixture-of-Experts Zihan Qiu Zeyu Huang Shuang Cheng Yizhi Zhou Zili Wang Ivan Titov Jie Fu MoE 46 2 0 13 Aug 2024
Tricks for Training Sparse Translation Models Dheeru Dua Shruti Bhosale Vedanuj Goswami James Cross M. Lewis Angela Fan MoE 131 18 0 15 Oct 2021
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism M. Shoeybi M. Patwary Raul Puri P. LeGresley Jared Casper Bryan Catanzaro MoE 243 1,791 0 17 Sep 2019