GLU Variants Improve Transformer

12 February 2020

Noam M. Shazeer

ArXiv (abs)PDF HTML HuggingFace (4 upvotes)

Papers citing "GLU Variants Improve Transformer"

50 / 905 papers shown

Maximum Score Routing For Mixture-of-ExpertsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

120

18 Aug 2025

CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems

128

15 Aug 2025

Efficient Patent Searching Using Graph Transformers

145

14 Aug 2025

FuXi-β: Towards a Lightweight and Fast Large-Scale Generative Recommendation Model

149

14 Aug 2025

MambaTrans: Multimodal Fusion Image Translation via Large Language Model Priors for Downstream Visual Tasks

116

11 Aug 2025

AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning

148

09 Aug 2025

gpt-oss-120b & gpt-oss-20b Model Card

...

137

282

08 Aug 2025

MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows

260

08 Aug 2025

MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs

128

07 Aug 2025

Channel-Wise MLPs Improve the Generalization of Recurrent Convolutional Networks

Nathan Breslow

AI4CE

06 Aug 2025

Markov Chain Estimation with In-Context Learning

Simon Lepage

Jérémie Mary

David Picard

110

05 Aug 2025

H3R: Hybrid Multi-view Correspondence for Generalizable 3D Reconstruction

190

05 Aug 2025

Parameter-Efficient Routed Fine-Tuning: Mixture-of-Experts Demands Mixture of Adaptation Modules

124

04 Aug 2025

LOST: Low-rank and Sparse Pre-training for Large Language Models

155

04 Aug 2025

Learning Dynamics of Meta-Learning in Small Model Pretraining

David Demitri Africa

Yuval Weiss

P. Buttery

Richard Diehl Martinez

AI4CE

214

04 Aug 2025

MHARFedLLM: Multimodal Human Activity Recognition Using Federated Large Language Model

119

03 Aug 2025

ChEmbed: Enhancing Chemical Literature Search Through Domain-Specific Text Embeddings

Ali Shiraee Kasmaee

Mohammad Khodadad

Mehdi Astaraki

Mohammad Arshi Saloot

Nicholas Sherck

H. Mahyar

Soheila Samiee

158

03 Aug 2025

Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler

208

02 Aug 2025

On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective

Gabriel Mongaras

Eric C. Larson

113

31 Jul 2025

GovRelBench:A Benchmark for Government Domain Relevance

188

29 Jul 2025

Doubling Your Data in Minutes: Ultra-fast Tabular Data Generation via LLM-Induced Dependency Graphs

201

25 Jul 2025

DIFFA: Large Language Diffusion Models Can Listen and Understand

...

226

24 Jul 2025

Adaptive Neural Quantum States: A Recurrent Neural Network Perspective

Jake McNaughton

Mohamed Hibat-Allah

24 Jul 2025

GRR-CoCa: Leveraging LLM Mechanisms in Multimodal Model Architectures

Jake R. Patock

Nicole Catherine Lewis

144

24 Jul 2025

Technical Report of TeleChat2, TeleChat2.5 and T1

...

428

24 Jul 2025

The Early Bird Identifies the Worm: You Can't Beat a Head Start in Long-Term Body Re-ID (ECHO-BID)

Thomas M. Metz

Matthew Q. Hill

A. O’toole

241

23 Jul 2025

Scaling Linear Attention with Sparse State Expansion

298

22 Jul 2025

Supernova: Achieving More with Less in Transformer Architectures

Andrei-Valentin Tanase

Elena Pelican

164

21 Jul 2025

Diffusion Beats Autoregressive in Data-Constrained Settings

343

21 Jul 2025

Rethinking LLM Inference Bottlenecks: Insights from Latent Attention and Mixture-of-Experts

...

219

21 Jul 2025

Latent Denoising Makes Good Visual Tokenizers

196

21 Jul 2025

Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training

293

14 Jul 2025

Scaling Laws for Optimal Data Mixtures

210

12 Jul 2025

BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

263

11 Jul 2025

Memory Mosaics at scale

Jianyu Zhang

Léon Bottou

CLL

344

04 Jul 2025

Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models

Lauren Hyoseo Yoon

Yisong Yue

Been Kim

380

01 Jul 2025

FlatFusion: Delving into Details of Sparse Transformer-based Camera-LiDAR Fusion for Autonomous Driving

276

01 Jul 2025

Hierarchical Reasoning Model

519

26 Jun 2025

SKOLR: Structured Koopman Operator Linear RNN for Time-Series Forecasting

263

17 Jun 2025

Self-supervised Representation Learning with Local Aggregation for Image-based Profiling

306

17 Jun 2025

Load Balancing Mixture of Experts with Similarity Preserving Routers

287

16 Jun 2025

Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems

Tuan Nguyen

Long-Vu Hoang

Huy-Dat Tran

224

16 Jun 2025

GTA: Grouped-head latenT Attention

177

15 Jun 2025

Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling

203

14 Jun 2025

BSA: Ball Sparse Attention for Large-scale Geometries

Catalin E. Brita

Hieu Nguyen

Lohithsai Yadala Chanchu

Domonkos Nagy

Maksim Zhdanov

215

14 Jun 2025

One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers

Diana Abagyan

Alejandro Salamanca

Andres Felipe Cruz-Salinas

376

12 Jun 2025

DIVE into MoE: Diversity-Enhanced Reconstruction of Large Language Models from Dense into Mixture-of-ExpertsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

210

11 Jun 2025

ABC-FHE : A Resource-Efficient Accelerator Enabling Bootstrappable Parameters for Client-Side Fully Homomorphic EncryptionDesign Automation Conference (DAC), 2025

325

10 Jun 2025

CausalPFN: Amortized Causal Effect Estimation via In-Context Learning

201

09 Jun 2025

Learning Distribution-Wise Control in Representation Space for Language Models

Chunyuan Deng

Ruidi Chang

Hanjie Chen

274

07 Jun 2025