GLU Variants Improve Transformer

12 February 2020

Noam M. Shazeer

ArXiv (abs)PDF HTML HuggingFace (4 upvotes)

Papers citing "GLU Variants Improve Transformer"

50 / 904 papers shown

Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention

Kai Li

Kejun Gao

Xiaolin Hu

28 Sep 2025

QuadEnhancer: Leveraging Quadratic Transformations to Enhance Deep Neural Networks

124

28 Sep 2025

Disentangling Score Content and Performance Style for Joint Piano Rendering and Transcription

Wei Zeng

Junchuan Zhao

Ye Wang

124

28 Sep 2025

Impute-MACFM: Imputation based on Mask-Aware Flow Matching

Dengyi Liu

Honggang Wang

Hua Fang

142

27 Sep 2025

Stochastic activations

Pierre-Emmanuel Mazaré

Hervé Jégou

LLMSV

264

26 Sep 2025

IIET: Efficient Numerical Transformer via Implicit Iterative Euler Method

178

26 Sep 2025

Compute-Optimal Quantization-Aware Training

Aleksandr Dremov

David Grangier

Angelos Katharopoulos

Awni Y. Hannun

128

26 Sep 2025

Real-Time Object Detection Meets DINOv3

364

25 Sep 2025

GZSL-MoE: Apprentissage G{é}n{é}ralis{é} Z{é}ro-Shot bas{é} sur le M{é}lange dÉxperts pour la Segmentation S{é}mantique de Nuages de Points 3DAppliqu{é} {à} un Jeu de Donn{é}es dÉnvironnement de Collaboration Humain-Robot

Ahed Alboody

23 Sep 2025

SimpleFold: Folding Proteins is Simpler than You Think

Miguel Angel Bautista

270

23 Sep 2025

Understanding Post-Training Structural Changes in Large Language Models

Xinyu He

Xianghui Cao

158

22 Sep 2025

Training-free Truthfulness Detection via Value Vectors in LLMs

22 Sep 2025

Rethinking the Role of Text Complexity in Language Model Pretraining

Dan John Velasco

M. R

211

20 Sep 2025

Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research

Richard Diehl Martinez

140

19 Sep 2025

Neural Speech Separation with Parallel Amplitude and Phase Spectrum Estimation

Fei Liu

Yang Ai

Zhen-Hua Ling

113

17 Sep 2025

NIRVANA: Structured pruning reimagined for large language models compression

1.6K

17 Sep 2025

Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models

Yuval Weiss

David Demitri Africa

P. Buttery

Richard Diehl Martinez

262

16 Sep 2025

MFAF: An EVA02-Based Multi-scale Frequency Attention Fusion Method for Cross-View Geo-Localization

YiTong Liu

Tianzhu Liu

Yanfeng Gu

130

16 Sep 2025

AERIS: Argonne Earth Systems Model for Reliable and Skillful Predictions

...

153

16 Sep 2025

Lightweight Metadata-Aware Mixture-of-Experts Masked Autoencoder for Earth Observation

Mohanad Albughdadi

MoE

13 Sep 2025

ENSI: Efficient Non-Interactive Secure Inference for Large Language Models

120

11 Sep 2025

ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms

192

11 Sep 2025

Open-sci-ref-0.01: open and reproducible reference baselines for language model and dataset comparison

165

10 Sep 2025

Practice on Long Behavior Sequence Modeling in Tencent Advertising

...

10 Sep 2025

When FinTech Meets Privacy: Securing Financial LLMs with Differential Private Fine-Tuning

135

10 Sep 2025

Causal Attention with Lookahead Keys

189

09 Sep 2025

ALICE: An Interpretable Neural Architecture for Generalization in Substitution Ciphers

Jeff Shen

Lindsay Smith

AI4CE

156

08 Sep 2025

RL Fine-Tuning Heals OOD Forgetting in SFT

179

08 Sep 2025

CURE: Controlled Unlearning for Robust Embeddings - Mitigating Conceptual Shortcuts in Pre-Trained Language Models

102

05 Sep 2025

Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

Deniz Bayazit

Aaron Mueller

Antoine Bosselut

140

05 Sep 2025

FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies

162

05 Sep 2025

Elucidating the Design Space of Decay in Linear Attention

Zhen Qin

Xuyang Shen

Yiran Zhong

100

05 Sep 2025

Multi-level SSL Feature Gating for Audio Deepfake Detection

Pierre-François Marteau

David Guennec

132

03 Sep 2025

Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens

151

03 Sep 2025

Meta-Pretraining for Zero-Shot Cross-Lingual Named Entity Recognition in Low-Resource Philippine Languages

Richard Diehl Martinez

164

02 Sep 2025

Preserving Bilinear Weight Spectra with a Signed and Shrunk Quadratic Activation Function

Jason Abohwo

Thomas Mosen

02 Sep 2025

LLM Encoder vs. Decoder: Robust Detection of Chinese AI-Generated Text with LoRA

31 Aug 2025

Universal Properties of Activation Sparsity in Modern Large Language Models

153

30 Aug 2025

Mechanistic interpretability for steering vision-language-action models

156

30 Aug 2025

QZhou-Embedding Technical Report

29 Aug 2025

Provable Benefits of In-Tool Learning for Large Language Models

152

28 Aug 2025

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

175

26 Aug 2025

UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

...

173

26 Aug 2025

Training Transformers for Mesh-Based Simulations

25 Aug 2025

Layerwise Importance Analysis of Feed-Forward Networks in Transformer-based Language Models

25 Aug 2025

DualSparse-MoE: Coordinating Tensor/Neuron-Level Sparsity with Expert Partition and Reconstruction

116

25 Aug 2025

Exploring Scaling Laws of CTR Model for Online Performance ImprovementACM Conference on Recommender Systems (RecSys), 2025

178

21 Aug 2025

Generative AI models capture realistic sea-ice evolution from days to decades

133

20 Aug 2025

Maximum Score Routing For Mixture-of-ExpertsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

114

18 Aug 2025

CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems

119

15 Aug 2025