GLU Variants Improve Transformer

12 February 2020

Noam M. Shazeer

ArXiv (abs)PDF HTML HuggingFace (4 upvotes)

Papers citing "GLU Variants Improve Transformer"

50 / 904 papers shown

MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections

493

13 Feb 2025

Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM

1.1K

10 Feb 2025

When, Where and Why to Average Weights?International Conference on Machine Learning (ICML), 2025

559

10 Feb 2025

The Curse of Depth in Large Language Models

406

09 Feb 2025

High-Fidelity Simultaneous Speech-To-Speech Translation

998

05 Feb 2025

$FuXi-$\alpha$: Scaling Recommendation Model with Feature Interaction Enhanced Transformer$

FuXi-

\alpha

: Scaling Recommendation Model with Feature Interaction Enhanced TransformerThe Web Conference (WWW), 2025

...

330

05 Feb 2025

Transformers trained on proteins can learn to attend to Euclidean distance

Isaac Ellmen

Constantin Schneider

Matthew I.J. Raybould

Charlotte M. Deane

241

03 Feb 2025

CoddLLM: Empowering Large Language Models for Data Analytics

Asterios Katsifodimos

899

01 Feb 2025

iFormer: Integrating ConvNet and Transformer for Mobile ApplicationInternational Conference on Learning Representations (ICLR), 2025

Chuanyang Zheng

ViT

396

26 Jan 2025

A Comprehensive Survey of Foundation Models in MedicineIEEE Reviews in Biomedical Engineering (RBME), 2024

772

17 Jan 2025

SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM TrainingInternational Conference on Learning Representations (ICLR), 2025

394

12 Jan 2025

Tensor Product Attention Is All You Need

787

11 Jan 2025

EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion ModelsInternational Symposium on High-Performance Computer Architecture (HPCA), 2025

273

10 Jan 2025

Merging Feed-Forward Sublayers for Compressed Transformers

377

10 Jan 2025

CURing Large Models: Compression via CUR Decomposition

Sanghyeon Park

Soo-Mook Moon

349

08 Jan 2025

SLAM: Towards Efficient Multilingual Reasoning via Selective Language AlignmentInternational Conference on Computational Linguistics (COLING), 2025

259

08 Jan 2025

VMamba: Visual State Space ModelNeural Information Processing Systems (NeurIPS), 2024

1.1K

1,554

31 Dec 2024

ELECTRA and GPT-4o: Cost-Effective Partners for Sentiment Analysis

James P. Beno

VLM

290

29 Dec 2024

PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System

...

296

28 Dec 2024

Segment-Based Attention Masking for GPTsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

157

24 Dec 2024

Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNANeural Information Processing Systems (NeurIPS), 2024

318

18 Dec 2024

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

...

457

389

18 Dec 2024

Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LNInternational Conference on Learning Representations (ICLR), 2024

Pengxiang Li

Lu Yin

Shiwei Liu

295

18 Dec 2024

Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture

Jingze Shi

Yiran Peng

282

16 Dec 2024

PunchBench: Benchmarking MLLMs in Multimodal Punchline ComprehensionAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

391

16 Dec 2024

LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational ComplexityComputer Vision and Pattern Recognition (CVPR), 2024

...

426

13 Dec 2024

Code LLMs: A Taxonomy-based SurveyBigData Congress [Services Society] (BSS), 2024

Nishat Raihan

Christian D. Newman

Marcos Zampieri

377

11 Dec 2024

VariFace: Fair and Diverse Synthetic Dataset Generation for Face Recognition

503

09 Dec 2024

GuARD: Effective Anomaly Detection through a Text-Rich and Graph-Informed Language Model

297

05 Dec 2024

AntLM: Bridging Causal and Masked Language Models

331

04 Dec 2024

TruncFormer: Private LLM Inference Using Only Truncations

263

02 Dec 2024

RandAR: Decoder-only Autoregressive Visual Generation in Random OrdersComputer Vision and Pattern Recognition (CVPR), 2024

392

02 Dec 2024

Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis

665

02 Dec 2024

ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance & Efficiency on a Specific Domain

Ali Shiraee Kasmaee

Mohammad Khodadad

Mohammad Arshi Saloot

1.3K

30 Nov 2024

H3Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs

363

26 Nov 2024

MH-MoE: Multi-Head Mixture-of-Experts

378

25 Nov 2024

LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer AttributionsComputer Vision and Pattern Recognition (CVPR), 2024

Faridoun Mehri

Mahdieh Soleymani Baghshah

Mohammad Taher Pilehvar

296

24 Nov 2024

MolMetaLM: a Physicochemical Knowledge-Guided Molecular Meta Language Model

362

23 Nov 2024

Signformer is all you need: Towards Edge AI for Sign Language

Eta Yang

SLR

311

19 Nov 2024

Selective Attention: Enhancing Transformer through Principled Context ControlNeural Information Processing Systems (NeurIPS), 2024

Xuechen Zhang

Xiangyu Chang

Mingchen Li

Amit K. Roy-Chowdhury

Jiasi Chen

Samet Oymak

260

19 Nov 2024

BanglaDialecto: An End-to-End AI-Powered Regional Speech StandardizationBigData Congress [Services Society] (BSS), 2024

Md. Nazmus Sadat Samin

226

16 Nov 2024

Empowering Meta-Analysis: Leveraging Large Language Models for Scientific SynthesisBigData Congress [Services Society] (BSS), 2024

Jawad Ibn Ahad

Rafeed Mohammad Sultan

197

16 Nov 2024

Xmodel-1.5: An 1B-scale Multilingual LLM

361

15 Nov 2024

Hysteresis Activation Function for Efficient Inference

471

15 Nov 2024

Unraveling the Gradient Descent Dynamics of TransformersNeural Information Processing Systems (NeurIPS), 2024

322

12 Nov 2024

FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training

304

12 Nov 2024

More Expressive Attention with Negative Weights

428

11 Nov 2024

Scaling Laws for PrecisionInternational Conference on Learning Representations (ICLR), 2024

387

07 Nov 2024

OpenCoder: The Open Cookbook for Top-Tier Code Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

J.K. Liu

...

484

07 Nov 2024

Character-level Tokenizations as Powerful Inductive Biases for RNA Foundational Models

Adrián Morales-Pastor

Bertran Miquel-Oliver

Álvaro Ciudad

Alexis Molina

AI4CE

253

05 Nov 2024