ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2202.01169
  4. Cited By
Unified Scaling Laws for Routed Language Models

Unified Scaling Laws for Routed Language Models

2 February 2022
Aidan Clark
Diego de Las Casas
Aurelia Guy
A. Mensch
Michela Paganini
Jordan Hoffmann
Bogdan Damoc
Blake A. Hechtman
Trevor Cai
Sebastian Borgeaud
George van den Driessche
Eliza Rutherford
Tom Hennigan
Matthew J. Johnson
Katie Millican
Albin Cassirer
Chris Jones
Elena Buchatskaya
David Budden
Laurent Sifre
Simon Osindero
Oriol Vinyals
Jack W. Rae
Erich Elsen
Koray Kavukcuoglu
Karen Simonyan
    MoE
ArXivPDFHTML

Papers citing "Unified Scaling Laws for Routed Language Models"

50 / 146 papers shown
Title
Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts
  Language Models
Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models
Tianwen Wei
Bo Zhu
Liang Zhao
Cheng Cheng
Biye Li
...
Yutuan Ma
Rui Hu
Shuicheng Yan
Han Fang
Yahui Zhou
MoE
41
24
0
03 Jun 2024
MoEUT: Mixture-of-Experts Universal Transformers
MoEUT: Mixture-of-Experts Universal Transformers
Róbert Csordás
Kazuki Irie
Jürgen Schmidhuber
Christopher Potts
Christopher D. Manning
MoE
42
5
0
25 May 2024
Decoding at the Speed of Thought: Harnessing Parallel Decoding of
  Lexical Units for LLMs
Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs
Chenxi Sun
Hongzhi Zhang
Zijia Lin
Jingyuan Zhang
Fuzheng Zhang
...
Bin Chen
Chengru Song
Di Zhang
Kun Gai
Deyi Xiong
38
1
0
24 May 2024
Graph Sparsification via Mixture of Graphs
Graph Sparsification via Mixture of Graphs
Guibin Zhang
Xiangguo Sun
Yanwei Yue
Kun Wang
Tianlong Chen
Shirui Pan
28
8
0
23 May 2024
Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models
Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models
Yongxin Guo
Zhenglin Cheng
Xiaoying Tang
Tao R. Lin
Tao Lin
MoE
53
7
0
23 May 2024
SUTRA: Scalable Multilingual Language Model Architecture
SUTRA: Scalable Multilingual Language Model Architecture
Abhijit Bendale
Michael Sapienza
Steven Ripplinger
Simon Gibbs
Jaewon Lee
Pranav Mistry
LRM
ELM
34
4
0
07 May 2024
Knowledge Distillation vs. Pretraining from Scratch under a Fixed
  (Computation) Budget
Knowledge Distillation vs. Pretraining from Scratch under a Fixed (Computation) Budget
Minh Duc Bui
Fabian David Schmidt
Goran Glavaš
K. Wense
14
0
0
30 Apr 2024
Multi-Head Mixture-of-Experts
Multi-Head Mixture-of-Experts
Xun Wu
Shaohan Huang
Wenhui Wang
Furu Wei
MoE
28
12
0
23 Apr 2024
From Matching to Generation: A Survey on Generative Information Retrieval
From Matching to Generation: A Survey on Generative Information Retrieval
Xiaoxi Li
Jiajie Jin
Yujia Zhou
Yuyao Zhang
Peitian Zhang
Yutao Zhu
Zhicheng Dou
3DV
67
46
0
23 Apr 2024
Toward Inference-optimal Mixture-of-Expert Large Language Models
Toward Inference-optimal Mixture-of-Expert Large Language Models
Longfei Yun
Yonghao Zhuang
Yao Fu
Eric P. Xing
Hao Zhang
MoE
65
6
0
03 Apr 2024
The Larger the Better? Improved LLM Code-Generation via Budget
  Reallocation
The Larger the Better? Improved LLM Code-Generation via Budget Reallocation
Michael Hassid
Tal Remez
Jonas Gehring
Roy Schwartz
Yossi Adi
34
20
0
31 Mar 2024
Jamba: A Hybrid Transformer-Mamba Language Model
Jamba: A Hybrid Transformer-Mamba Language Model
Opher Lieber
Barak Lenz
Hofit Bata
Gal Cohen
Jhonathan Osin
...
Nir Ratner
N. Rozen
Erez Shwartz
Mor Zusman
Y. Shoham
26
207
0
28 Mar 2024
Scaling Laws For Dense Retrieval
Scaling Laws For Dense Retrieval
Yan Fang
Jingtao Zhan
Qingyao Ai
Jiaxin Mao
Weihang Su
Jia Chen
Yiqun Liu
123
8
0
27 Mar 2024
Multi-Task Dense Prediction via Mixture of Low-Rank Experts
Multi-Task Dense Prediction via Mixture of Low-Rank Experts
Yuqi Yang
Peng-Tao Jiang
Qibin Hou
Hao Zhang
Jinwei Chen
Bo-wen Li
MoE
33
18
0
26 Mar 2024
Understanding Emergent Abilities of Language Models from the Loss Perspective
Understanding Emergent Abilities of Language Models from the Loss Perspective
Zhengxiao Du
Aohan Zeng
Yuxiao Dong
Jie Tang
UQCV
LRM
62
46
0
23 Mar 2024
DiPaCo: Distributed Path Composition
DiPaCo: Distributed Path Composition
Arthur Douillard
Qixuang Feng
Andrei A. Rusu
A. Kuncoro
Yani Donchev
Rachita Chhaparia
Ionel Gog
MarcÁurelio Ranzato
Jiajun Shen
Arthur Szlam
MoE
40
2
0
15 Mar 2024
Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse
  Mixture-of-Experts
Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts
Byeongjun Park
Hyojun Go
Jin-Young Kim
Sangmin Woo
Seokil Ham
Changick Kim
DiffM
MoE
51
13
0
14 Mar 2024
Mastering Text, Code and Math Simultaneously via Fusing Highly
  Specialized Language Models
Mastering Text, Code and Math Simultaneously via Fusing Highly Specialized Language Models
Ning Ding
Yulin Chen
Ganqu Cui
Xingtai Lv
Weilin Zhao
Ruobing Xie
Bowen Zhou
Zhiyuan Liu
Maosong Sun
ALM
MoMe
AI4CE
38
7
0
13 Mar 2024
Conditional computation in neural networks: principles and research
  trends
Conditional computation in neural networks: principles and research trends
Simone Scardapane
Alessandro Baiocchi
Alessio Devoto
V. Marsocci
Pasquale Minervini
Jary Pomponi
34
1
0
12 Mar 2024
Unraveling the Mystery of Scaling Laws: Part I
Unraveling the Mystery of Scaling Laws: Part I
Hui Su
Zhi Tian
Xiaoyu Shen
Xunliang Cai
28
19
0
11 Mar 2024
DMoERM: Recipes of Mixture-of-Experts for Effective Reward Modeling
DMoERM: Recipes of Mixture-of-Experts for Effective Reward Modeling
Shanghaoran Quan
MoE
OffRL
43
9
0
02 Mar 2024
Towards an empirical understanding of MoE design choices
Towards an empirical understanding of MoE design choices
Dongyang Fan
Bettina Messmer
Martin Jaggi
26
10
0
20 Feb 2024
HyperMoE: Towards Better Mixture of Experts via Transferring Among
  Experts
HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts
Hao Zhao
Zihan Qiu
Huijia Wu
Zili Wang
Zhaofeng He
Jie Fu
MoE
30
9
0
20 Feb 2024
Model Compression and Efficient Inference for Large Language Models: A
  Survey
Model Compression and Efficient Inference for Large Language Models: A Survey
Wenxiao Wang
Wei Chen
Yicong Luo
Yongliu Long
Zhengkai Lin
Liye Zhang
Binbin Lin
Deng Cai
Xiaofei He
MQ
36
46
0
15 Feb 2024
Can Large Language Models Learn Independent Causal Mechanisms?
Can Large Language Models Learn Independent Causal Mechanisms?
Gael Gendron
Bao Trung Nguyen
A. Peng
Michael Witbrock
Gillian Dobbie
LRM
12
3
0
04 Feb 2024
CompeteSMoE -- Effective Training of Sparse Mixture of Experts via
  Competition
CompeteSMoE -- Effective Training of Sparse Mixture of Experts via Competition
Quang-Cuong Pham
Giang Do
Huy Nguyen
TrungTin Nguyen
Chenghao Liu
...
Binh T. Nguyen
Savitha Ramasamy
Xiaoli Li
Steven C. H. Hoi
Nhat Ho
22
17
0
04 Feb 2024
BlackMamba: Mixture of Experts for State-Space Models
BlackMamba: Mixture of Experts for State-Space Models
Quentin G. Anthony
Yury Tokpanov
Paolo Glorioso
Beren Millidge
20
21
0
01 Feb 2024
Routers in Vision Mixture of Experts: An Empirical Study
Routers in Vision Mixture of Experts: An Empirical Study
Tianlin Liu
Mathieu Blondel
C. Riquelme
J. Puigcerver
MoE
36
3
0
29 Jan 2024
LocMoE: A Low-Overhead MoE for Large Language Model Training
LocMoE: A Low-Overhead MoE for Large Language Model Training
Jing Li
Zhijie Sun
Xuan He
Li Zeng
Yi Lin
Entong Li
Binfan Zheng
Rongqian Zhao
Xin Chen
MoE
30
11
0
25 Jan 2024
Scaling Laws for Forgetting When Fine-Tuning Large Language Models
Scaling Laws for Forgetting When Fine-Tuning Large Language Models
Damjan Kalajdzievski
CLL
32
8
0
11 Jan 2024
Mixtral of Experts
Mixtral of Experts
Albert Q. Jiang
Alexandre Sablayrolles
Antoine Roux
A. Mensch
Blanche Savary
...
Théophile Gervet
Thibaut Lavril
Thomas Wang
Timothée Lacroix
William El Sayed
MoE
LLMAG
16
976
0
08 Jan 2024
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
Róbert Csordás
Piotr Piekos
Kazuki Irie
Jürgen Schmidhuber
MoE
14
14
0
13 Dec 2023
LinguaLinked: A Distributed Large Language Model Inference System for
  Mobile Devices
LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices
Junchen Zhao
Yurun Song
Simeng Liu
Ian G. Harris
S. Jyothi
16
5
0
01 Dec 2023
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Elias Frantar
Dan Alistarh
MQ
MoE
21
24
0
25 Oct 2023
Approximating Two-Layer Feedforward Networks for Efficient Transformers
Approximating Two-Layer Feedforward Networks for Efficient Transformers
Róbert Csordás
Kazuki Irie
Jürgen Schmidhuber
MoE
16
18
0
16 Oct 2023
Exploiting Activation Sparsity with Dense to Dynamic-k
  Mixture-of-Experts Conversion
Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion
Filip Szatkowski
Eric Elmoznino
Younesse Kaddar
Simone Scardapane
MoE
30
5
0
06 Oct 2023
OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model
  Pre-trained from Scratch
OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from Scratch
Juntao Li
Zecheng Tang
Yuyang Ding
Pinzheng Wang
Pei Guo
...
Wenliang Chen
Guohong Fu
Qiaoming Zhu
Guodong Zhou
M. Zhang
40
5
0
19 Sep 2023
Scaling Laws for Sparsely-Connected Foundation Models
Scaling Laws for Sparsely-Connected Foundation Models
Elias Frantar
C. Riquelme
N. Houlsby
Dan Alistarh
Utku Evci
16
34
0
15 Sep 2023
Pretraining on the Test Set Is All You Need
Pretraining on the Test Set Is All You Need
Rylan Schaeffer
13
29
0
13 Sep 2023
Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient
  MoE for Instruction Tuning
Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning
Ted Zadouri
A. Ustun
Arash Ahmadian
Beyza Ermics
Acyr F. Locatelli
Sara Hooker
MoE
30
88
0
11 Sep 2023
LLMCad: Fast and Scalable On-device Large Language Model Inference
LLMCad: Fast and Scalable On-device Large Language Model Inference
Daliang Xu
Wangsong Yin
Xin Jin
Y. Zhang
Shiyun Wei
Mengwei Xu
Xuanzhe Liu
17
43
0
08 Sep 2023
Task-Based MoE for Multitask Multilingual Machine Translation
Task-Based MoE for Multitask Multilingual Machine Translation
Hai Pham
Young Jin Kim
Subhabrata Mukherjee
David P. Woodruff
Barnabás Póczós
Hany Awadalla
MoE
28
4
0
30 Aug 2023
Large Language Models for Information Retrieval: A Survey
Large Language Models for Information Retrieval: A Survey
Yutao Zhu
Huaying Yuan
Shuting Wang
Jiongnan Liu
Wenhan Liu
Chenlong Deng
Haonan Chen
Zhicheng Dou
Ji-Rong Wen
KELM
44
283
0
14 Aug 2023
Experts Weights Averaging: A New General Training Scheme for Vision
  Transformers
Experts Weights Averaging: A New General Training Scheme for Vision Transformers
Yongqian Huang
Peng Ye
Xiaoshui Huang
Sheng R. Li
Tao Chen
Tong He
Wanli Ouyang
MoMe
15
8
0
11 Aug 2023
From Sparse to Soft Mixtures of Experts
From Sparse to Soft Mixtures of Experts
J. Puigcerver
C. Riquelme
Basil Mustafa
N. Houlsby
MoE
121
114
0
02 Aug 2023
Scaling Laws for Imitation Learning in Single-Agent Games
Scaling Laws for Imitation Learning in Single-Agent Games
Jens Tuyls
Dhruv Madeka
Kari Torkkola
Dean Phillips Foster
Karthik Narasimhan
Sham Kakade
24
4
0
18 Jul 2023
T-MARS: Improving Visual Representations by Circumventing Text Feature
  Learning
T-MARS: Improving Visual Representations by Circumventing Text Feature Learning
Pratyush Maini
Sachin Goyal
Zachary Chase Lipton
J. Zico Kolter
Aditi Raghunathan
VLM
29
33
0
06 Jul 2023
Beyond Scale: the Diversity Coefficient as a Data Quality Metric
  Demonstrates LLMs are Pre-trained on Formally Diverse Data
Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data
Alycia Lee
Brando Miranda
Sudharsan Sundar
Sanmi Koyejo
32
6
0
24 Jun 2023
Soft Merging of Experts with Adaptive Routing
Soft Merging of Experts with Adaptive Routing
Mohammed Muqeeth
Haokun Liu
Colin Raffel
MoMe
MoE
24
45
0
06 Jun 2023
COMET: Learning Cardinality Constrained Mixture of Experts with Trees
  and Local Search
COMET: Learning Cardinality Constrained Mixture of Experts with Trees and Local Search
Shibal Ibrahim
Wenyu Chen
Hussein Hazimeh
Natalia Ponomareva
Zhe Zhao
Rahul Mazumder
MoE
19
3
0
05 Jun 2023
Previous
123
Next