GLU Variants Improve Transformer

12 February 2020

Noam M. Shazeer

ArXiv (abs)PDF HTML HuggingFace (4 upvotes)

Papers citing "GLU Variants Improve Transformer"

50 / 904 papers shown

The Hidden Attention of Mamba Models

Ameen Ali

Itamar Zimerman

Lior Wolf

Mamba

514

03 Mar 2024

VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks

Chunhua Shen

383

01 Mar 2024

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

George-Christian Muraru

...

David Budden

271

190

29 Feb 2024

RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks

167

29 Feb 2024

On the Challenges and Opportunities in Generative AI

...

758

28 Feb 2024

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

278

324

27 Feb 2024

Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models

347

26 Feb 2024

MambaIR: A Simple Baseline for Image Restoration with State-Space Model

366

518

23 Feb 2024

Position: Categorical Deep Learning is an Algebraic Theory of All Architectures

297

23 Feb 2024

Understanding and Patching Compositional Reasoning in LLMs

Defu Lian

232

22 Feb 2024

Improving Language Understanding from Screenshots

201

21 Feb 2024

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

Chenyang Song

Xu Han

Zhengyan Zhang

...

Zhiyuan Liu

Maosong Sun

370

21 Feb 2024

Transformer tricks: Precomputing the first layer

Nils Graef

MoE

136

20 Feb 2024

Empirical Study on Updating Key-Value Memories in Transformer Feed-forward Layers

178

19 Feb 2024

Can Large Multimodal Models Uncover Deep Semantics Behind Images?

Qingxiu Dong

Zhifang Sui

181

17 Feb 2024

PointMamba: A Simple State Space Model for Point Cloud Analysis

Dingkang Liang

Xiaoqing Ye

437

199

16 Feb 2024

Towards Privacy-Aware Sign Language Translation at Scale

249

14 Feb 2024

Transformers Can Achieve Length Generalization But Not Robustly

286

14 Feb 2024

Spectral Filters, Dark Signals, and Attention Sinks

Nicola Cancedda

215

14 Feb 2024

PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs

197

13 Feb 2024

Learn To be Efficient: Build Structured Sparsity in Large Language Models

Beidi Chen

280

09 Feb 2024

Mamba-UNet: UNet-Like Pure Visual Mamba for Medical Image Segmentation

329

228

07 Feb 2024

ReLU

^2

Wins: Discovering Efficient Activation Functions for Sparse LLMs

Zhengyan Zhang

Yixin Song

Guanghui Yu

Xu Han

Yankai Lin

Chaojun Xiao

Chenyang Song

Zhiyuan Liu

Zeyu Mi

Maosong Sun

248

06 Feb 2024

CogCoM: A Visual Language Model with Chain-of-Manipulations ReasoningInternational Conference on Learning Representations (ICLR), 2024

Ji Qi

...

Bin Xu

Lei Hou

Juanzi Li

Yuxiao Dong

Jie Tang

VLM LRM

243

06 Feb 2024

A Survey on Transformer Compression

463

05 Feb 2024

DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

Matteo Pagliardini

Amirkeivan Mohtashami

François Fleuret

Martin Jaggi

251

04 Feb 2024

Unified Training of Universal Time Series Forecasting Transformers

Silvio Savarese

370

389

04 Feb 2024

Leveraging Continuously Differentiable Activation Functions for Learning in Quantized Noisy Environments

Vivswan Shah

Nathan Youngblood

357

04 Feb 2024

Learning Structure-Aware Representations of Dependent Types

Konstantinos Kogkalidis

Orestis Melkonian

Jean-Philippe Bernardy

NAI

185

03 Feb 2024

From PEFT to DEFT: Parameter Efficient Finetuning for Reducing Activation Density in Transformers

410

02 Feb 2024

Nomic Embed: Training a Reproducible Long Context Text Embedder

348

216

02 Feb 2024

Investigating Recurrent Transformers with Dynamic Halt

Jishnu Ray Chowdhury

Cornelia Caragea

538

01 Feb 2024

OLMo: Accelerating the Science of Language Models

Dirk Groeneveld

Iz Beltagy

Pete Walsh

Akshita Bhagia

Rodney Michael Kinney

...

Jesse Dodge

Kyle Lo

Luca Soldaini

Noah A. Smith

Hanna Hajishirzi

OSLM

649

544

01 Feb 2024

BlackMamba: Mixture of Experts for State-Space Models

164

01 Feb 2024

LOCOST: State-Space Models for Long Document Abstractive Summarization

404

31 Jan 2024

Weaver: Foundation Models for Creative Writing

...

Ningyu Zhang

Huajun Chen

Yuchen Eleanor Jiang

Wangchunshu Zhou

259

30 Jan 2024

TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese

266

30 Jan 2024

OpenMoE: An Early Effort on Open Mixture-of-Experts Language ModelsInternational Conference on Machine Learning (ICML), 2024

Yang You

289

155

29 Jan 2024

Baichuan2-Sum: Instruction Finetune Baichuan2-7B Model for Dialogue SummarizationIEEE International Joint Conference on Neural Network (IJCNN), 2024

289

27 Jan 2024

The Case for Co-Designing Model Architectures with HardwareInternational Conference on Parallel Processing (ICPP), 2024

Deepak Narayanan

Dhabaleswar Panda

134

25 Jan 2024

TURNA: A Turkish Encoder-Decoder Language Model for Enhanced Understanding and GenerationAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Gokcce Uludougan

Zeynep Yirmibecsouglu Balal

211

25 Jan 2024

A Survey of Deep Learning and Foundation Models for Time Series Forecasting

270

25 Jan 2024

SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection

Sanjiv Kumar

194

24 Jan 2024

In-Context Language Learning: Architectures and AlgorithmsInternational Conference on Machine Learning (ICML), 2024

Bailin Wang

388

23 Jan 2024

MVSFormer++: Revealing the Devil in Transformer's Details for Multi-View StereoInternational Conference on Learning Representations (ICLR), 2024

Chenjie Cao

Xinlin Ren

Yanwei Fu

239

22 Jan 2024

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion TransformersInternational Conference on Machine Learning (ICML), 2024

Katherine Crowson

Stefan Andreas Baumann

Alex Birch

Tanishq Mathew Abraham

Daniel Z. Kaplan

Enrico Shippole

337

21 Jan 2024

A Study on Training and Developing Large Language Models for Behavior Tree Generation

255

16 Jan 2024

Extreme Compression of Large Language Models via Additive QuantizationInternational Conference on Machine Learning (ICML), 2024

Denis Kuznedelev

Dan Alistarh

417

149

11 Jan 2024

xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of ProteinbioRxiv (bioRxiv), 2024

...

Yuxiao Dong

246

134

11 Jan 2024

FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference

Zirui Liu

Qingquan Song

Q. Xiao

Sathiya Keerthi Selvaraj

Rahul Mazumder

Aman Gupta

Helen Zhou

166

08 Jan 2024