ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2112.10684
  4. Cited By
Efficient Large Scale Language Modeling with Mixtures of Experts

Efficient Large Scale Language Modeling with Mixtures of Experts

20 December 2021
Mikel Artetxe
Shruti Bhosale
Naman Goyal
Todor Mihaylov
Myle Ott
Sam Shleifer
Xi Victoria Lin
Jingfei Du
Srini Iyer
Ramakanth Pasunuru
Giridhar Anantharaman
Xian Li
Shuohui Chen
H. Akın
Mandeep Baines
Louis Martin
Xing Zhou
Punit Singh Koura
Brian O'Horo
Jeff Wang
Luke Zettlemoyer
Mona T. Diab
Zornitsa Kozareva
Ves Stoyanov
    MoE
ArXivPDFHTML

Papers citing "Efficient Large Scale Language Modeling with Mixtures of Experts"

50 / 145 papers shown
Title
DeepSeekMoE: Towards Ultimate Expert Specialization in
  Mixture-of-Experts Language Models
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Damai Dai
Chengqi Deng
Chenggang Zhao
R. X. Xu
Huazuo Gao
...
Panpan Huang
Fuli Luo
Chong Ruan
Zhifang Sui
W. Liang
MoE
34
243
0
11 Jan 2024
MoE-Mamba: Efficient Selective State Space Models with Mixture of
  Experts
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
Maciej Pióro
Kamil Ciebiera
Krystian Król
Jan Ludziejewski
Michał Krutul
Jakub Krajewski
Szymon Antoniak
Piotr Miłoś
Marek Cygan
Sebastian Jaszczur
MoE
Mamba
20
54
0
08 Jan 2024
DEAP: Design Space Exploration for DNN Accelerator Parallelism
DEAP: Design Space Exploration for DNN Accelerator Parallelism
Ekansh Agrawal
Xiangyu Sam Xu
11
1
0
24 Dec 2023
Mixture-of-Linear-Experts for Long-term Time Series Forecasting
Mixture-of-Linear-Experts for Long-term Time Series Forecasting
Ronghao Ni
Zinan Lin
Shuaiqi Wang
Giulia Fanti
AI4TS
26
15
0
11 Dec 2023
Building Trustworthy NeuroSymbolic AI Systems: Consistency, Reliability,
  Explainability, and Safety
Building Trustworthy NeuroSymbolic AI Systems: Consistency, Reliability, Explainability, and Safety
Manas Gaur
Amit P. Sheth
26
17
0
05 Dec 2023
Learning to Skip for Language Modeling
Learning to Skip for Language Modeling
Dewen Zeng
Nan Du
Tao Wang
Yuanzhong Xu
Tao Lei
Zhifeng Chen
Claire Cui
17
11
0
26 Nov 2023
Memory Augmented Language Models through Mixture of Word Experts
Memory Augmented Language Models through Mixture of Word Experts
Cicero Nogueira dos Santos
James Lee-Thorp
Isaac Noble
Chung-Ching Chang
David C. Uthus
MoE
23
8
0
15 Nov 2023
Intentional Biases in LLM Responses
Intentional Biases in LLM Responses
Nicklaus Badyal
Derek Jacoby
Yvonne Coady
8
4
0
11 Nov 2023
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Elias Frantar
Dan Alistarh
MQ
MoE
19
24
0
25 Oct 2023
Cross-Lingual Consistency of Factual Knowledge in Multilingual Language
  Models
Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models
Jirui Qi
Raquel Fernández
Arianna Bisazza
KELM
HILM
19
59
0
16 Oct 2023
Merging Experts into One: Improving Computational Efficiency of Mixture
  of Experts
Merging Experts into One: Improving Computational Efficiency of Mixture of Experts
Shwai He
Run-Ze Fan
Liang Ding
Li Shen
Tianyi Zhou
Dacheng Tao
MoE
MoMe
27
14
0
15 Oct 2023
Not All Demonstration Examples are Equally Beneficial: Reweighting
  Demonstration Examples for In-Context Learning
Not All Demonstration Examples are Equally Beneficial: Reweighting Demonstration Examples for In-Context Learning
Zhe Yang
Damai Dai
Peiyi Wang
Zhifang Sui
34
9
0
12 Oct 2023
MAD Max Beyond Single-Node: Enabling Large Machine Learning Model
  Acceleration on Distributed Systems
MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems
Samuel Hsia
Alicia Golden
Bilge Acun
Newsha Ardalani
Zach DeVito
Gu-Yeon Wei
David Brooks
Carole-Jean Wu
MoE
38
9
0
04 Oct 2023
Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit
  Quantization and Robustness
Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness
Young Jin Kim
Raffy Fahim
Hany Awadalla
MQ
MoE
56
19
0
03 Oct 2023
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its
  Routing Policy
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy
Pingzhi Li
Zhenyu (Allen) Zhang
Prateek Yadav
Yi-Lin Sung
Yu Cheng
Mohit Bansal
Tianlong Chen
MoMe
21
33
0
02 Oct 2023
LLMCarbon: Modeling the end-to-end Carbon Footprint of Large Language
  Models
LLMCarbon: Modeling the end-to-end Carbon Footprint of Large Language Models
Ahmad Faiz
S. Kaneda
Ruhan Wang
Rita Osi
Parteek Sharma
Fan Chen
Lei Jiang
23
56
0
25 Sep 2023
Scaling Laws for Sparsely-Connected Foundation Models
Scaling Laws for Sparsely-Connected Foundation Models
Elias Frantar
C. Riquelme
N. Houlsby
Dan Alistarh
Utku Evci
16
34
0
15 Sep 2023
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked
  Prefills
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
Amey Agrawal
Ashish Panwar
Jayashree Mohan
Nipun Kwatra
Bhargav S. Gulavani
R. Ramjee
AI4TS
LRM
31
91
0
31 Aug 2023
SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with
  Tunable Memory Budget
SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget
Rui Kong
Yuanchun Li
Qingtian Feng
Weijun Wang
Xiaozhou Ye
Ye Ouyang
L. Kong
Yunxin Liu
MoE
27
8
0
29 Aug 2023
D4: Improving LLM Pretraining via Document De-Duplication and
  Diversification
D4: Improving LLM Pretraining via Document De-Duplication and Diversification
Kushal Tirumala
Daniel Simig
Armen Aghajanyan
Ari S. Morcos
SyDa
11
103
0
23 Aug 2023
Experts Weights Averaging: A New General Training Scheme for Vision
  Transformers
Experts Weights Averaging: A New General Training Scheme for Vision Transformers
Yongqian Huang
Peng Ye
Xiaoshui Huang
Sheng R. Li
Tao Chen
Tong He
Wanli Ouyang
MoMe
11
8
0
11 Aug 2023
Bringing order into the realm of Transformer-based language models for
  artificial intelligence and law
Bringing order into the realm of Transformer-based language models for artificial intelligence and law
C. M. Greco
Andrea Tagarelli
AILaw
16
19
0
10 Aug 2023
A Survey of Techniques for Optimizing Transformer Inference
A Survey of Techniques for Optimizing Transformer Inference
Krishna Teja Chitty-Venkata
Sparsh Mittal
M. Emani
V. Vishwanath
Arun Somani
31
62
0
16 Jul 2023
COMET: Learning Cardinality Constrained Mixture of Experts with Trees
  and Local Search
COMET: Learning Cardinality Constrained Mixture of Experts with Trees and Local Search
Shibal Ibrahim
Wenyu Chen
Hussein Hazimeh
Natalia Ponomareva
Zhe Zhao
Rahul Mazumder
MoE
19
3
0
05 Jun 2023
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora
  with Web Data, and Web Data Only
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Guilherme Penedo
Quentin Malartic
Daniel Hesslow
Ruxandra-Aimée Cojocaru
Alessandro Cappelli
Hamza Alobeidli
B. Pannier
Ebtesam Almazrouei
Julien Launay
27
744
0
01 Jun 2023
Intriguing Properties of Quantization at Scale
Intriguing Properties of Quantization at Scale
Arash Ahmadian
Saurabh Dash
Hongyu Chen
Bharat Venkitesh
Stephen Gou
Phil Blunsom
A. Ustun
Sara Hooker
MQ
40
38
0
30 May 2023
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for
  Large Language Models
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models
Sheng Shen
Le Hou
Yan-Quan Zhou
Nan Du
Shayne Longpre
...
Vincent Zhao
Hongkun Yu
Kurt Keutzer
Trevor Darrell
Denny Zhou
ALM
MoE
25
54
0
24 May 2023
Towards A Unified View of Sparse Feed-Forward Network in Pretraining
  Large Language Model
Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model
Leo Liu
Tim Dettmers
Xi Victoria Lin
Ves Stoyanov
Xian Li
MoE
18
9
0
23 May 2023
The "code'' of Ethics:A Holistic Audit of AI Code Generators
The "code'' of Ethics:A Holistic Audit of AI Code Generators
Wanlun Ma
Yiliao Song
Minhui Xue
Sheng Wen
Yang Xiang
22
3
0
22 May 2023
Evaluation of medium-large Language Models at zero-shot closed book
  generative question answering
Evaluation of medium-large Language Models at zero-shot closed book generative question answering
René Peinl
Johannes Wirth
ELM
18
7
0
19 May 2023
Evaluating ChatGPT's Information Extraction Capabilities: An Assessment
  of Performance, Explainability, Calibration, and Faithfulness
Evaluating ChatGPT's Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness
Bo Li
Gexiang Fang
Yang Yang
Quansen Wang
Wei Ye
Wen Zhao
Shikun Zhang
ELM
AI4MH
19
154
0
23 Apr 2023
Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism
Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism
Xin Chen
Hengheng Zhang
Xiaotao Gu
Kaifeng Bi
Lingxi Xie
Qi Tian
MoE
14
4
0
22 Apr 2023
Scaling Expert Language Models with Unsupervised Domain Discovery
Scaling Expert Language Models with Unsupervised Domain Discovery
Suchin Gururangan
Margaret Li
M. Lewis
Weijia Shi
Tim Althoff
Noah A. Smith
Luke Zettlemoyer
MoE
15
46
0
24 Mar 2023
PanGu-Σ: Towards Trillion Parameter Language Model with Sparse
  Heterogeneous Computing
PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing
Xiaozhe Ren
Pingyi Zhou
Xinfan Meng
Xinjing Huang
Yadao Wang
...
Jiansheng Wei
Xin Jiang
Teng Su
Qun Liu
Jun Yao
ALM
MoE
67
60
0
20 Mar 2023
A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize
  Mixture-of-Experts Training
A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training
Siddharth Singh
Olatunji Ruwase
A. A. Awan
Samyam Rajbhandari
Yuxiong He
A. Bhatele
MoE
30
30
0
11 Mar 2023
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert
  (MoE) Inference
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
Haiyang Huang
Newsha Ardalani
Anna Y. Sun
Liu Ke
Hsien-Hsin S. Lee
Anjali Sridhar
Shruti Bhosale
Carole-Jean Wu
Benjamin C. Lee
MoE
65
22
0
10 Mar 2023
Modular Deep Learning
Modular Deep Learning
Jonas Pfeiffer
Sebastian Ruder
Ivan Vulić
E. Ponti
MoMe
OOD
19
73
0
22 Feb 2023
AdapterSoup: Weight Averaging to Improve Generalization of Pretrained
  Language Models
AdapterSoup: Weight Averaging to Improve Generalization of Pretrained Language Models
Alexandra Chronopoulou
Matthew E. Peters
Alexander M. Fraser
Jesse Dodge
MoMe
11
65
0
14 Feb 2023
Distillation of encoder-decoder transformers for sequence labelling
Distillation of encoder-decoder transformers for sequence labelling
M. Farina
D. Pappadopulo
Anant Gupta
Leslie Huang
Ozan Irsoy
Thamar Solorio
VLM
74
3
0
10 Feb 2023
Multipath agents for modular multitask ML systems
Multipath agents for modular multitask ML systems
Andrea Gesmundo
18
1
0
06 Feb 2023
OPT-IML: Scaling Language Model Instruction Meta Learning through the
  Lens of Generalization
OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
Srinivasan Iyer
Xi Victoria Lin
Ramakanth Pasunuru
Todor Mihaylov
Daniel Simig
...
Jeff Wang
Christopher Dewan
Asli Celikyilmaz
Luke Zettlemoyer
Veselin Stoyanov
ALM
31
259
0
22 Dec 2022
Training Trajectories of Language Models Across Scales
Training Trajectories of Language Models Across Scales
Mengzhou Xia
Mikel Artetxe
Chunting Zhou
Xi Victoria Lin
Ramakanth Pasunuru
Danqi Chen
Luke Zettlemoyer
Ves Stoyanov
AIFin
LRM
28
52
0
19 Dec 2022
A Natural Bias for Language Generation Models
A Natural Bias for Language Generation Models
Clara Meister
Wojciech Stokowiec
Tiago Pimentel
Lei Yu
Laura Rimell
A. Kuncoro
MILM
25
6
0
19 Dec 2022
BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting
BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting
Zheng-Xin Yong
Hailey Schoelkopf
Niklas Muennighoff
Alham Fikri Aji
David Ifeoluwa Adelani
...
Genta Indra Winata
Stella Biderman
Edward Raff
Dragomir R. Radev
Vassilina Nikoulina
CLL
VLM
AI4CE
LRM
27
81
0
19 Dec 2022
ALERT: Adapting Language Models to Reasoning Tasks
ALERT: Adapting Language Models to Reasoning Tasks
Ping Yu
Tianlu Wang
O. Yu. Golovneva
Badr AlKhamissi
Siddharth Verma
Zhijing Jin
Gargi Ghosh
Mona T. Diab
Asli Celikyilmaz
ReLM
LRM
32
21
0
16 Dec 2022
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
Aran Komatsuzaki
J. Puigcerver
James Lee-Thorp
Carlos Riquelme Ruiz
Basil Mustafa
Joshua Ainslie
Yi Tay
Mostafa Dehghani
N. Houlsby
MoMe
MoE
19
109
0
09 Dec 2022
DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and
  Training Efficiency via Efficient Data Sampling and Routing
DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing
Conglong Li
Z. Yao
Xiaoxia Wu
Minjia Zhang
Connor Holmes
Cheng Li
Yuxiong He
19
23
0
07 Dec 2022
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
Trevor Gale
Deepak Narayanan
C. Young
Matei A. Zaharia
MoE
8
102
0
29 Nov 2022
Spatial Mixture-of-Experts
Spatial Mixture-of-Experts
Nikoli Dryden
Torsten Hoefler
MoE
24
9
0
24 Nov 2022
A Universal Discriminator for Zero-Shot Generalization
A Universal Discriminator for Zero-Shot Generalization
Haike Xu
Zongyu Lin
Jing Zhou
Yanan Zheng
Zhilin Yang
AI4CE
13
14
0
15 Nov 2022
Previous
123
Next