GLU Variants Improve Transformer

12 February 2020

Noam M. Shazeer

ArXiv (abs)PDF HTML HuggingFace (4 upvotes)

Papers citing "GLU Variants Improve Transformer"

50 / 904 papers shown

dots.llm1 Technical Report

...

198

06 Jun 2025

Scaling Transformers for Discriminative Recommendation via Generative Pretraining

355

04 Jun 2025

Scaling Fine-Grained MoE Beyond 50B Parameters: Empirical Evaluation and Practical Insights

207

03 Jun 2025

How Programming Concepts and Neurons Are Shared in Code Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

Amir Hossein Kargaran

Yihong Liu

François Yvon

Hinrich Schütze

196

01 Jun 2025

Equivalent Linear Mappings of Large Language Models

James R. Golden

172

30 May 2025

Differential Gated Self-Attention

Elpiniki Maria Lygizou

Mónika Farsang

Radu Grosu

200

29 May 2025

Continuous Chain of Thought Enables Parallel Exploration and Reasoning

Halil Alperen Gozeten

396

29 May 2025

Exploring Scaling Laws for EHR Foundation Models

219

29 May 2025

RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination

196

28 May 2025

Automated Essay Scoring Incorporating Annotations from Automated Feedback Systems

Christopher Ormerod

246

28 May 2025

FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference

Jonathan Ragan-Kelley

Yoon Kim

VLM

213

28 May 2025

HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer

...

195

28 May 2025

Learning in Compact Spaces with Approximately Normalized Transformer

Katharina Eggensperger

Michael Hefenbrock

263

28 May 2025

In Search of Adam's Secret Sauce

Antonio Orvieto

Robert Gower

369

27 May 2025

How does Alignment Enhance LLMs' Multilingual Capabilities? A Language Neurons Perspective

302

27 May 2025

Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders

317

27 May 2025

LlamaSeg: Image Segmentation via Autoregressive Mask Generation

364

26 May 2025

Understanding Transformer from the Perspective of Associative Memory

226

26 May 2025

Burst Image Super-Resolution via Multi-Cross Attention Encoding and Multi-Scan State-Space DecodingImage and Vision Computing (IVC), 2025

221

26 May 2025

Towards Fully FP8 GEMM LLM Training at Scale

Alejandro Hernández Cano

Dhia Garbaya

Imanol Schlag

Martin Jaggi

364

26 May 2025

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

...

431

25 May 2025

Why Do Some Inputs Break Low-Bit LLM Quantization?

273

24 May 2025

How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation

217

24 May 2025

COUNTDOWN: Contextually Sparse Activation Filtering Out Unnecessary Weights in Down Projection

Jaewon Cheon

Pilsung Kang

350

23 May 2025

PLUMAGE: Probabilistic Low rank Unbiased Min Variance Gradient Estimator for Efficient Large Model Training

Matan Haroush

Daniel Soudry

362

23 May 2025

QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design

235

22 May 2025

Locate-then-Merge: Neuron-Level Parameter Fusion for Mitigating Catastrophic Forgetting in Multimodal LLMs

Zeping Yu

Sophia Ananiadou

MoMe KELM CLL

258

22 May 2025

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

434

22 May 2025

Amplify Adjacent Token Differences: Enhancing Long Chain-of-Thought Reasoning with Shift-FFN

238

22 May 2025

MaxPoolBERT: Enhancing BERT Classification via Layer- and Token-Wise Aggregation

Maike Behrendt

Stefan Sylvius Wagner

Stefan Harmeling

SSeg

523

21 May 2025

BanglaByT5: Byte-Level Modelling for Bangla

Pramit Bhattacharyya

Arnab Bhattacharya

257

21 May 2025

Guarded Query Routing for Large Language Models

422

20 May 2025

This Time is Different: An Observability Perspective on Time Series Foundation Models

...

492

20 May 2025

Scaling Law for Quantization-Aware Training

...

279

20 May 2025

Output Scaling: YingLong-Delayed Chain of Thought in a Large Pretrained Time Series Forecasting Model

236

20 May 2025

Systematic Generalization in Language Models Scales with Information EntropyAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

Sondre Wold

Lucas Georges Gabriel Charpentier

Étienne Simon

442

19 May 2025

A3 : an Analytical Low-Rank Approximation Framework for Attention

George A. Constantinides

Wayne Luk

Yiren Zhao

OffRL MQ

365

19 May 2025

Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training

384

19 May 2025

Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space

262

19 May 2025

PiT: Progressive Diffusion Transformer

614

19 May 2025

Unveiling Knowledge Utilization Mechanisms in LLM-based Retrieval-Augmented GenerationAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2025

222

17 May 2025

Efficiently Building a Domain-Specific Large Language Model from Scratch: A Case Study of a Classical Chinese Large Language Model

201

17 May 2025

Chain-of-Model Learning for Language Model

...

488

17 May 2025

MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production

...

411

16 May 2025

Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput

261

14 May 2025

Large Language Models for Computer-Aided Design: A Survey

391

13 May 2025

DELPHYNE: A Pre-Trained Model for General and Financial Time Series

123

12 May 2025

Circuit Partitioning Using Large Language Models for Quantum Compilation and Simulations

Pranav Sinha

Sumit Kumar Jha

Sunny Raj

229

12 May 2025

Comet: Accelerating Private Inference for Large Language Model by Predicting Activation SparsityIEEE Symposium on Security and Privacy (S&P), 2025

293

12 May 2025

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

...

900

10 May 2025