ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2204.09179
  4. Cited By
On the Representation Collapse of Sparse Mixture of Experts
v1v2v3 (latest)

On the Representation Collapse of Sparse Mixture of Experts

Neural Information Processing Systems (NeurIPS), 2022
20 April 2022
Zewen Chi
Li Dong
Shaohan Huang
Damai Dai
Shuming Ma
Barun Patra
Saksham Singhal
Payal Bajaj
Xia Song
Xian-Ling Mao
Heyan Huang
Furu Wei
    MoMeMoE
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)

Papers citing "On the Representation Collapse of Sparse Mixture of Experts"

38 / 88 papers shown
Title
Statistical Advantages of Perturbing Cosine Router in Mixture of Experts
Statistical Advantages of Perturbing Cosine Router in Mixture of ExpertsInternational Conference on Learning Representations (ICLR), 2024
Huy Le Nguyen
Pedram Akbarian
Trang Pham
Trang Nguyen
Shujian Zhang
Nhat Ho
MoE
301
2
0
23 May 2024
Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture
  of Experts
Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture of Experts
Huy Nguyen
Nhat Ho
Alessandro Rinaldo
309
14
0
22 May 2024
A Foundation Model for Brain Lesion Segmentation with Mixture of
  Modality Experts
A Foundation Model for Brain Lesion Segmentation with Mixture of Modality ExpertsInternational Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2024
Xinru Zhang
N. Ou
Berke Doga Basaran
Marco Visentin
Mengyun Qiao
...
Ouyang Cheng
Yaou Liu
Paul M. Matthew
Chuyang Ye
Wenjia Bai
MedIm
136
16
0
16 May 2024
Multi-Head Mixture-of-Experts
Multi-Head Mixture-of-Experts
Xun Wu
Shaohan Huang
Wenhui Wang
Furu Wei
MoE
199
28
0
23 Apr 2024
Dense Training, Sparse Inference: Rethinking Training of
  Mixture-of-Experts Language Models
Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models
Bowen Pan
Songlin Yang
Haokun Liu
Mayank Mishra
Gaoyuan Zhang
Aude Oliva
Colin Raffel
Yikang Shen
MoE
246
29
0
08 Apr 2024
Not All Experts are Equal: Efficient Expert Pruning and Skipping for
  Mixture-of-Experts Large Language Models
Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
Xudong Lu
Zijun Chen
Yuhui Xu
Aojun Zhou
Siyuan Huang
Bo Zhang
Junchi Yan
Jiaming Song
MoE
247
65
0
22 Feb 2024
Model Compression and Efficient Inference for Large Language Models: A
  Survey
Model Compression and Efficient Inference for Large Language Models: A Survey
Wenxiao Wang
Wei Chen
Yicong Luo
Yongliu Long
Zhengkai Lin
Liye Zhang
Binbin Lin
Deng Cai
Xiaofei He
MQ
264
85
0
15 Feb 2024
A Survey on Transformer Compression
A Survey on Transformer Compression
Yehui Tang
Yunhe Wang
Jianyuan Guo
Zhijun Tu
Kai Han
Hailin Hu
Dacheng Tao
411
63
0
05 Feb 2024
FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion
FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion
Xing Han
Huy Nguyen
Carl Harris
Nhat Ho
Suchi Saria
MoE
338
45
0
05 Feb 2024
CompeteSMoE -- Effective Training of Sparse Mixture of Experts via
  Competition
CompeteSMoE -- Effective Training of Sparse Mixture of Experts via Competition
Quang Pham
Giang Do
Huy Nguyen
TrungTin Nguyen
Chenghao Liu
...
Binh T. Nguyen
Savitha Ramasamy
Xiaoli Li
Steven C. H. Hoi
Nhat Ho
151
22
0
04 Feb 2024
LocMoE: A Low-Overhead MoE for Large Language Model Training
LocMoE: A Low-Overhead MoE for Large Language Model TrainingInternational Joint Conference on Artificial Intelligence (IJCAI), 2024
Jing Li
Zhijie Sun
Xuan He
Rongqian Zhao
Binfan Zheng
Entong Li
Yi Lin
Rongqian Zhao
Xin Chen
MoE
319
21
0
25 Jan 2024
PanGu-$π$: Enhancing Language Model Architectures via Nonlinearity
  Compensation
PanGu-πππ: Enhancing Language Model Architectures via Nonlinearity Compensation
Yunhe Wang
Hanting Chen
Yehui Tang
Tianyu Guo
Kai Han
...
Qinghua Xu
Qun Liu
Jun Yao
Chao Xu
Dacheng Tao
257
23
0
27 Dec 2023
From Google Gemini to OpenAI Q* (Q-Star): A Survey of Reshaping the
  Generative Artificial Intelligence (AI) Research Landscape
From Google Gemini to OpenAI Q* (Q-Star): A Survey of Reshaping the Generative Artificial Intelligence (AI) Research Landscape
Timothy R. McIntosh
Teo Susnjak
Tong Liu
Paul Watters
Malka N. Halgamuge
355
73
0
18 Dec 2023
Adaptive Computation Modules: Granular Conditional Computation For
  Efficient Inference
Adaptive Computation Modules: Granular Conditional Computation For Efficient InferenceAAAI Conference on Artificial Intelligence (AAAI), 2023
Bartosz Wójcik
Alessio Devoto
Karol Pustelnik
Pasquale Minervini
Simone Scardapane
256
7
0
15 Dec 2023
LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models
  via MoE-Style Plugin
LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin
Jiajun Sun
Enyu Zhou
Yan Liu
Songyang Gao
Jun Zhao
...
Jiang Zhu
Rui Zheng
Tao Gui
Tao Gui
Xuanjing Huang
CLLMoEKELM
195
52
0
15 Dec 2023
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
SwitchHead: Accelerating Transformers with Mixture-of-Experts AttentionNeural Information Processing Systems (NeurIPS), 2023
Róbert Csordás
Piotr Piekos
Kazuki Irie
Jürgen Schmidhuber
MoE
179
27
0
13 Dec 2023
DAMEX: Dataset-aware Mixture-of-Experts for visual understanding of
  mixture-of-datasets
DAMEX: Dataset-aware Mixture-of-Experts for visual understanding of mixture-of-datasets
Yash Jain
Harkirat Singh Behl
Z. Kira
Vibhav Vineet
143
25
0
08 Nov 2023
Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation
Mixture of Tokens: Efficient LLMs through Cross-Example AggregationNeural Information Processing Systems (NeurIPS), 2023
Szymon Antoniak
Sebastian Jaszczur
Michal Krutul
Maciej Pióro
Jakub Krajewski
Jan Ludziejewski
Tomasz Odrzygó'zd'z
Marek Cygan
MoE
115
2
0
24 Oct 2023
Approximating Two-Layer Feedforward Networks for Efficient Transformers
Approximating Two-Layer Feedforward Networks for Efficient TransformersConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Róbert Csordás
Kazuki Irie
Jürgen Schmidhuber
MoE
329
21
0
16 Oct 2023
On the Embedding Collapse when Scaling up Recommendation Models
On the Embedding Collapse when Scaling up Recommendation ModelsInternational Conference on Machine Learning (ICML), 2023
Xingzhuo Guo
Kai Yan
Ximei Wang
Baixu Chen
Haohan Wang
Mingsheng Long
235
44
0
06 Oct 2023
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its
  Routing Policy
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing PolicyInternational Conference on Learning Representations (ICLR), 2023
Pingzhi Li
Zhenyu Zhang
Prateek Yadav
Yi-Lin Sung
Yu Cheng
Mohit Bansal
Tianlong Chen
MoMe
218
72
0
02 Oct 2023
Retentive Network: A Successor to Transformer for Large Language Models
Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun
Li Dong
Shaohan Huang
Shuming Ma
Yuqing Xia
Jilong Xue
Jianyong Wang
Furu Wei
LRM
653
495
0
17 Jul 2023
Soft Merging of Experts with Adaptive Routing
Soft Merging of Experts with Adaptive Routing
Mohammed Muqeeth
Haokun Liu
Colin Raffel
MoMeMoE
246
76
0
06 Jun 2023
One-stop Training of Multiple Capacity Models
One-stop Training of Multiple Capacity Models
Lan Jiang
Haoyang Huang
Dongdong Zhang
R. Jiang
Furu Wei
309
0
0
23 May 2023
Towards A Unified View of Sparse Feed-Forward Network in Pretraining
  Large Language Model
Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language ModelConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Leo Liu
Tim Dettmers
Xi Lin
Ves Stoyanov
Xian Li
MoE
126
12
0
23 May 2023
Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable
  Transformers
Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable TransformersInternational Conference on Learning Representations (ICLR), 2023
Tianlong Chen
Zhenyu Zhang
Ajay Jaiswal
Shiwei Liu
Zinan Lin
MoE
235
66
0
02 Mar 2023
Can representation learning for multimodal image registration be
  improved by supervision of intermediate layers?
Can representation learning for multimodal image registration be improved by supervision of intermediate layers?Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA), 2023
Elisabeth Wetzer
Patrick Micke
Natavsa Sladoje
SSL
187
2
0
01 Mar 2023
Language Is Not All You Need: Aligning Perception with Language Models
Language Is Not All You Need: Aligning Perception with Language ModelsNeural Information Processing Systems (NeurIPS), 2023
Shaohan Huang
Li Dong
Wenhui Wang
Y. Hao
Saksham Singhal
...
Johan Bjorck
Vishrav Chaudhary
Subhojit Som
Xia Song
Furu Wei
VLMLRMMLLM
292
659
0
27 Feb 2023
AdaEnsemble: Learning Adaptively Sparse Structured Ensemble Network for
  Click-Through Rate Prediction
AdaEnsemble: Learning Adaptively Sparse Structured Ensemble Network for Click-Through Rate Prediction
Yachen Yan
Liubo Li
158
4
0
06 Jan 2023
TorchScale: Transformers at Scale
TorchScale: Transformers at Scale
Shuming Ma
Hongyu Wang
Shaohan Huang
Wenhui Wang
Zewen Chi
...
Alon Benhaim
Barun Patra
Vishrav Chaudhary
Xia Song
Furu Wei
AI4CE
106
12
0
23 Nov 2022
Accelerating Distributed MoE Training and Inference with Lina
Accelerating Distributed MoE Training and Inference with LinaUSENIX Annual Technical Conference (USENIX ATC), 2022
Jiamin Li
Yimin Jiang
Yibo Zhu
Cong Wang
Hong-Yu Xu
MoE
175
101
0
31 Oct 2022
Mixture of Attention Heads: Selecting Attention Heads Per Token
Mixture of Attention Heads: Selecting Attention Heads Per TokenConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Xiaofeng Zhang
Songlin Yang
Zeyu Huang
Jie Zhou
Wenge Rong
Zhang Xiong
MoE
594
65
0
11 Oct 2022
MoEC: Mixture of Expert Clusters
MoEC: Mixture of Expert ClustersAAAI Conference on Artificial Intelligence (AAAI), 2022
Yuan Xie
Shaohan Huang
Tianyu Chen
Furu Wei
MoE
202
19
0
19 Jul 2022
Language Models are General-Purpose Interfaces
Language Models are General-Purpose Interfaces
Y. Hao
Haoyu Song
Li Dong
Shaohan Huang
Zewen Chi
Wenhui Wang
Shuming Ma
Furu Wei
MLLM
179
108
0
13 Jun 2022
Tutel: Adaptive Mixture-of-Experts at Scale
Tutel: Adaptive Mixture-of-Experts at ScaleConference on Machine Learning and Systems (MLSys), 2022
Changho Hwang
Wei Cui
Yifan Xiong
Ziyue Yang
Ze Liu
...
Joe Chau
Peng Cheng
Fan Yang
Mao Yang
Y. Xiong
MoE
326
182
0
07 Jun 2022
VL-BEiT: Generative Vision-Language Pretraining
VL-BEiT: Generative Vision-Language Pretraining
Hangbo Bao
Wenhui Wang
Li Dong
Furu Wei
VLM
158
48
0
02 Jun 2022
Task-Specific Expert Pruning for Sparse Mixture-of-Experts
Task-Specific Expert Pruning for Sparse Mixture-of-Experts
Tianyu Chen
Shaohan Huang
Yuan Xie
Binxing Jiao
Daxin Jiang
Haoyi Zhou
Jianxin Li
Furu Wei
MoE
208
52
0
01 Jun 2022
TyDi QA: A Benchmark for Information-Seeking Question Answering in
  Typologically Diverse Languages
TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse LanguagesTransactions of the Association for Computational Linguistics (TACL), 2020
J. Clark
Eunsol Choi
Michael Collins
Dan Garrette
Tom Kwiatkowski
Vitaly Nikolaev
J. Palomaki
536
683
0
10 Mar 2020
Previous
12