Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2202.01169
Cited By
Unified Scaling Laws for Routed Language Models
2 February 2022
Aidan Clark
Diego de Las Casas
Aurelia Guy
A. Mensch
Michela Paganini
Jordan Hoffmann
Bogdan Damoc
Blake A. Hechtman
Trevor Cai
Sebastian Borgeaud
George van den Driessche
Eliza Rutherford
Tom Hennigan
Matthew J. Johnson
Katie Millican
Albin Cassirer
Chris Jones
Elena Buchatskaya
David Budden
Laurent Sifre
Simon Osindero
Oriol Vinyals
Jack W. Rae
Erich Elsen
Koray Kavukcuoglu
Karen Simonyan
MoE
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Unified Scaling Laws for Routed Language Models"
46 / 146 papers shown
Title
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models
Sheng Shen
Le Hou
Yan-Quan Zhou
Nan Du
Shayne Longpre
...
Vincent Zhao
Hongkun Yu
Kurt Keutzer
Trevor Darrell
Denny Zhou
ALM
MoE
25
54
0
24 May 2023
Emergent inabilities? Inverse scaling over the course of pretraining
J. Michaelov
Benjamin Bergen
LRM
ReLM
20
3
0
24 May 2023
Think Before You Act: Decision Transformers with Working Memory
Jikun Kang
Romain Laroche
Xingdi Yuan
Adam Trischler
Xuefei Liu
Jie Fu
OffRL
17
0
0
24 May 2023
Multilingual Large Language Models Are Not (Yet) Code-Switchers
Ruochen Zhang
Samuel Cahyawijaya
Jan Christian Blaise Cruz
Genta Indra Winata
Alham Fikri Aji
LRM
33
50
0
23 May 2023
Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model
Leo Liu
Tim Dettmers
Xi Victoria Lin
Ves Stoyanov
Xian Li
MoE
18
9
0
23 May 2023
Are Emergent Abilities of Large Language Models a Mirage?
Rylan Schaeffer
Brando Miranda
Oluwasanmi Koyejo
LRM
39
395
0
28 Apr 2023
Revisiting Single-gated Mixtures of Experts
Amelie Royer
I. Karmanov
Andrii Skliar
B. Bejnordi
Tijmen Blankevoort
MoE
MoMe
14
6
0
11 Apr 2023
Graph Mixture of Experts: Learning on Large-Scale Graphs with Explicit Diversity Modeling
Haotao Wang
Ziyu Jiang
Yuning You
Yan Han
Gaowen Liu
Jayanth Srinivasa
Ramana Rao Kompella
Zhangyang Wang
19
27
0
06 Apr 2023
Scaling Expert Language Models with Unsupervised Domain Discovery
Suchin Gururangan
Margaret Li
M. Lewis
Weijia Shi
Tim Althoff
Noah A. Smith
Luke Zettlemoyer
MoE
15
46
0
24 Mar 2023
PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing
Xiaozhe Ren
Pingyi Zhou
Xinfan Meng
Xinjing Huang
Yadao Wang
...
Jiansheng Wei
Xin Jiang
Teng Su
Qun Liu
Jun Yao
ALM
MoE
67
60
0
20 Mar 2023
Scaling Vision-Language Models with Sparse Mixture of Experts
Sheng Shen
Z. Yao
Chunyuan Li
Trevor Darrell
Kurt Keutzer
Yuxiong He
VLM
MoE
11
62
0
13 Mar 2023
Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers
Tianlong Chen
Zhenyu (Allen) Zhang
Ajay Jaiswal
Shiwei Liu
Zhangyang Wang
MoE
25
46
0
02 Mar 2023
Modular Deep Learning
Jonas Pfeiffer
Sebastian Ruder
Ivan Vulić
E. Ponti
MoMe
OOD
19
73
0
22 Feb 2023
Optical Transformers
Maxwell G. Anderson
Shifan Ma
Tianyu Wang
Logan G. Wright
Peter L. McMahon
12
20
0
20 Feb 2023
The Power of External Memory in Increasing Predictive Model Capacity
Cenk Baykal
D. Cutler
Nishanth Dikkala
Nikhil Ghosh
Rina Panigrahy
Xin Wang
KELM
13
0
0
31 Jan 2023
Alternating Updates for Efficient Transformers
Cenk Baykal
D. Cutler
Nishanth Dikkala
Nikhil Ghosh
Rina Panigrahy
Xin Wang
MoE
40
5
0
30 Jan 2023
Scaling Laws for Generative Mixed-Modal Language Models
Armen Aghajanyan
L. Yu
Alexis Conneau
Wei-Ning Hsu
Karen Hambardzumyan
Susan Zhang
Stephen Roller
Naman Goyal
Omer Levy
Luke Zettlemoyer
MoE
VLM
19
103
0
10 Jan 2023
Cramming: Training a Language Model on a Single GPU in One Day
Jonas Geiping
Tom Goldstein
MoE
28
84
0
28 Dec 2022
RepMode: Learning to Re-parameterize Diverse Experts for Subcellular Structure Prediction
Donghao Zhou
Chunbin Gu
Junde Xu
Furui Liu
Qiong Wang
Guangyong Chen
Pheng-Ann Heng
MoE
13
4
0
20 Dec 2022
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
Aran Komatsuzaki
J. Puigcerver
James Lee-Thorp
Carlos Riquelme Ruiz
Basil Mustafa
Joshua Ainslie
Yi Tay
Mostafa Dehghani
N. Houlsby
MoMe
MoE
19
109
0
09 Dec 2022
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
Trevor Gale
Deepak Narayanan
C. Young
Matei A. Zaharia
MoE
8
102
0
29 Nov 2022
Spatial Mixture-of-Experts
Nikoli Dryden
Torsten Hoefler
MoE
24
9
0
24 Nov 2022
Broken Neural Scaling Laws
Ethan Caballero
Kshitij Gupta
Irina Rish
David M. Krueger
19
74
0
26 Oct 2022
M
3
^3
3
ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design
Hanxue Liang
Zhiwen Fan
Rishov Sarkar
Ziyu Jiang
Tianlong Chen
Kai Zou
Yu Cheng
Cong Hao
Zhangyang Wang
MoE
24
80
0
26 Oct 2022
Scaling Laws Beyond Backpropagation
Matthew J. Filipovich
Alessandro Cappelli
Daniel Hesslow
Julien Launay
16
3
0
26 Oct 2022
Precision Machine Learning
Eric J. Michaud
Ziming Liu
Max Tegmark
16
34
0
24 Oct 2022
Sparsity-Constrained Optimal Transport
Tianlin Liu
J. Puigcerver
Mathieu Blondel
OT
21
21
0
30 Sep 2022
Scaling Laws for a Multi-Agent Reinforcement Learning Model
Oren Neumann
C. Gros
24
26
0
29 Sep 2022
A Review of Sparse Expert Models in Deep Learning
W. Fedus
J. Dean
Barret Zoph
MoE
13
144
0
04 Sep 2022
New Auction Algorithms for Path Planning, Network Transport, and Reinforcement Learning
Dimitri Bertsekas
4
2
0
19 Jul 2022
Neural Implicit Dictionary via Mixture-of-Expert Training
Peihao Wang
Zhiwen Fan
Tianlong Chen
Zhangyang Wang
11
12
0
08 Jul 2022
Tutel: Adaptive Mixture-of-Experts at Scale
Changho Hwang
Wei Cui
Yifan Xiong
Ziyue Yang
Ze Liu
...
Joe Chau
Peng Cheng
Fan Yang
Mao Yang
Y. Xiong
MoE
92
108
0
07 Jun 2022
Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models
Kushal Tirumala
Aram H. Markosyan
Luke Zettlemoyer
Armen Aghajanyan
TDI
13
185
0
22 May 2022
Sparsely-gated Mixture-of-Expert Layers for CNN Interpretability
Svetlana Pavlitska
Christian Hubschneider
Lukas Struppek
J. Marius Zöllner
MoE
16
11
0
22 Apr 2022
On the Representation Collapse of Sparse Mixture of Experts
Zewen Chi
Li Dong
Shaohan Huang
Damai Dai
Shuming Ma
...
Payal Bajaj
Xia Song
Xian-Ling Mao
Heyan Huang
Furu Wei
MoMe
MoE
37
96
0
20 Apr 2022
Training Compute-Optimal Large Language Models
Jordan Hoffmann
Sebastian Borgeaud
A. Mensch
Elena Buchatskaya
Trevor Cai
...
Karen Simonyan
Erich Elsen
Jack W. Rae
Oriol Vinyals
Laurent Sifre
AI4TS
37
1,830
0
29 Mar 2022
Efficient Language Modeling with Sparse all-MLP
Ping Yu
Mikel Artetxe
Myle Ott
Sam Shleifer
Hongyu Gong
Ves Stoyanov
Xian Li
MoE
15
11
0
14 Mar 2022
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Barret Zoph
Irwan Bello
Sameer Kumar
Nan Du
Yanping Huang
J. Dean
Noam M. Shazeer
W. Fedus
MoE
24
181
0
17 Feb 2022
A Survey on Dynamic Neural Networks for Natural Language Processing
Canwen Xu
Julian McAuley
AI4CE
24
28
0
15 Feb 2022
Efficient Large Scale Language Modeling with Mixtures of Experts
Mikel Artetxe
Shruti Bhosale
Naman Goyal
Todor Mihaylov
Myle Ott
...
Jeff Wang
Luke Zettlemoyer
Mona T. Diab
Zornitsa Kozareva
Ves Stoyanov
MoE
50
188
0
20 Dec 2021
Unbiased Gradient Estimation with Balanced Assignments for Mixtures of Experts
W. Kool
Chris J. Maddison
A. Mnih
26
10
0
24 Sep 2021
Scalable and Efficient MoE Training for Multitask Multilingual Models
Young Jin Kim
A. A. Awan
Alexandre Muzio
Andres Felipe Cruz Salinas
Liyang Lu
Amr Hendy
Samyam Rajbhandari
Yuxiong He
Hany Awadalla
MoE
94
84
0
22 Sep 2021
Learning Curve Theory
Marcus Hutter
128
58
0
08 Feb 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
248
1,986
0
31 Dec 2020
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
226
4,453
0
23 Jan 2020
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
243
1,817
0
17 Sep 2019
Previous
1
2
3