ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2002.05202
  4. Cited By
GLU Variants Improve Transformer

GLU Variants Improve Transformer

12 February 2020
Noam M. Shazeer
ArXiv (abs)PDFHTMLHuggingFace (4 upvotes)

Papers citing "GLU Variants Improve Transformer"

50 / 904 papers shown
Rethinking Performance Gains in Image Dehazing Networks
Rethinking Performance Gains in Image Dehazing Networks
Yuda Song
Yang Zhou
Hui Qian
Xin Du
SSeg
169
70
0
23 Sep 2022
Automatic Label Sequence Generation for Prompting Sequence-to-sequence
  Models
Automatic Label Sequence Generation for Prompting Sequence-to-sequence ModelsInternational Conference on Computational Linguistics (COLING), 2022
Zichun Yu
Tianyu Gao
Zhengyan Zhang
Yankai Lin
Zhiyuan Liu
Maosong Sun
Jie Zhou
VLMLRM
115
2
0
20 Sep 2022
MUST-VQA: MUltilingual Scene-text VQA
MUST-VQA: MUltilingual Scene-text VQA
Emanuele Vivoli
Ali Furkan Biten
Andrés Mafla
Dimosthenis Karatzas
Lluís Gómez
248
8
0
14 Sep 2022
Transformers with Learnable Activation Functions
Transformers with Learnable Activation FunctionsFindings (Findings), 2022
Haishuo Fang
Ji-Ung Lee
N. Moosavi
Iryna Gurevych
AI4CE
274
11
0
30 Aug 2022
Multiple Instance Neuroimage Transformer
Multiple Instance Neuroimage Transformer
Ayush Singla
Qingyu Zhao
Daniel K. Do
Yuyin Zhou
K. Pohl
Ehsan Adeli
ViTMedIm
166
12
0
19 Aug 2022
MVSFormer: Multi-View Stereo by Learning Robust Image Features and
  Temperature-based Depth
MVSFormer: Multi-View Stereo by Learning Robust Image Features and Temperature-based Depth
Chenjie Cao
Xinlin Ren
Yanwei Fu
402
81
0
04 Aug 2022
giMLPs: Gate with Inhibition Mechanism in MLPs
Cheng Kang
Jindich Prokop
Lei Tong
Huiyu Zhou
Yong Hu
Daneil Novak
163
0
0
01 Aug 2022
Scaling Laws vs Model Architectures: How does Inductive Bias Influence
  Scaling?
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Yi Tay
Mostafa Dehghani
Samira Abnar
Hyung Won Chung
W. Fedus
J. Rao
Sharan Narang
Vinh Q. Tran
Dani Yogatama
Donald Metzler
AI4CE
248
121
0
21 Jul 2022
Long Range Language Modeling via Gated State Spaces
Long Range Language Modeling via Gated State SpacesInternational Conference on Learning Representations (ICLR), 2022
Harsh Mehta
Ankit Gupta
Ashok Cutkosky
Behnam Neyshabur
Mamba
527
332
0
27 Jun 2022
On the Parameterization and Initialization of Diagonal State Space
  Models
On the Parameterization and Initialization of Diagonal State Space ModelsNeural Information Processing Systems (NeurIPS), 2022
Albert Gu
Ankit Gupta
Karan Goel
Christopher Ré
413
473
0
23 Jun 2022
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale
  Knowledge
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale KnowledgeNeural Information Processing Systems (NeurIPS), 2022
Linxi Fan
Guanzhi Wang
Yunfan Jiang
Ajay Mandlekar
Yuncong Yang
Haoyi Zhu
Andrew Tang
De-An Huang
Yuke Zhu
Anima Anandkumar
LM&Ro
498
495
0
17 Jun 2022
Rank Diminishing in Deep Neural Networks
Rank Diminishing in Deep Neural NetworksNeural Information Processing Systems (NeurIPS), 2022
Ruili Feng
Kecheng Zheng
Yukun Huang
Deli Zhao
Michael I. Jordan
Zhengjun Zha
227
38
0
13 Jun 2022
Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT
Sparse Mixers: Combining MoE and Mixing to build a more efficient BERTConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
James Lee-Thorp
Joshua Ainslie
MoE
220
13
0
24 May 2022
BanglaNLG and BanglaT5: Benchmarks and Resources for Evaluating
  Low-Resource Natural Language Generation in Bangla
BanglaNLG and BanglaT5: Benchmarks and Resources for Evaluating Low-Resource Natural Language Generation in BanglaFindings (Findings), 2022
Abhik Bhattacharjee
Tahmid Hasan
Wasi Uddin Ahmad
Rifat Shahriyar
AIMatLM&MA
358
44
0
23 May 2022
Life after BERT: What do Other Muppets Understand about Language?
Life after BERT: What do Other Muppets Understand about Language?Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Vladislav Lialin
Kevin Zhao
Namrata Shivagunde
Anna Rumshisky
363
7
0
21 May 2022
A Generalist Agent
A Generalist Agent
Scott E. Reed
Konrad Zolna
Emilio Parisotto
Sergio Gomez Colmenarejo
Alexander Novikov
...
Yutian Chen
R. Hadsell
Oriol Vinyals
Mahyar Bordbar
Nando de Freitas
LM&RoLLMAGAI4CE
444
976
0
12 May 2022
Supplementary Material: Implementation and Experiments for GAU-based
  Model
Supplementary Material: Implementation and Experiments for GAU-based Model
Zhenjie Liu
125
0
0
12 May 2022
UL2: Unifying Language Learning Paradigms
UL2: Unifying Language Learning ParadigmsInternational Conference on Learning Representations (ICLR), 2022
Yi Tay
Mostafa Dehghani
Vinh Q. Tran
Xavier Garcia
Jason W. Wei
...
Tal Schuster
H. Zheng
Denny Zhou
N. Houlsby
Donald Metzler
AI4CE
566
359
0
10 May 2022
Boosting Adversarial Transferability of MLP-Mixer
Boosting Adversarial Transferability of MLP-Mixer
Haoran Lyu
Yajie Wang
Yu-an Tan
Huipeng Zhou
Yuhang Zhao
Quan-xin Zhang
AAML
180
1
0
26 Apr 2022
What Language Model Architecture and Pretraining Objective Work Best for
  Zero-Shot Generalization?
What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?International Conference on Machine Learning (ICML), 2022
Thomas Wang
Adam Roberts
Daniel Hesslow
Teven Le Scao
Hyung Won Chung
Iz Beltagy
Julien Launay
Colin Raffel
285
215
0
12 Apr 2022
Simple Baselines for Image Restoration
Simple Baselines for Image RestorationEuropean Conference on Computer Vision (ECCV), 2022
Liangyu Chen
Xiaojie Chu
Xinming Zhang
Jian Sun
916
1,241
0
10 Apr 2022
ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise
  Semantic Alignment and Generation
ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and GenerationComputer Vision and Pattern Recognition (CVPR), 2022
Jianan Wang
Guansong Lu
Hang Xu
Zhenguo Li
Chunjing Xu
Yanwei Fu
246
18
0
09 Apr 2022
PaLM: Scaling Language Modeling with Pathways
PaLM: Scaling Language Modeling with PathwaysJournal of machine learning research (JMLR), 2022
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Gaurav Mishra
...
Kathy Meier-Hellstern
Douglas Eck
J. Dean
Slav Petrov
Noah Fiedel
PILMLRM
1.2K
7,457
0
05 Apr 2022
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts
  in the Vocabulary Space
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary SpaceConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Mor Geva
Avi Caciularu
Ke Wang
Yoav Goldberg
KELM
621
462
0
28 Mar 2022
Error Correction Code Transformer
Error Correction Code TransformerNeural Information Processing Systems (NeurIPS), 2022
Yoni Choukroun
Lior Wolf
214
80
0
27 Mar 2022
Geometry-Aware Supertagging with Heterogeneous Dynamic Convolutions
Geometry-Aware Supertagging with Heterogeneous Dynamic Convolutions
Konstantinos Kogkalidis
M. Moortgat
219
10
0
23 Mar 2022
IT5: Text-to-text Pretraining for Italian Language Understanding and
  Generation
IT5: Text-to-text Pretraining for Italian Language Understanding and GenerationInternational Conference on Language Resources and Evaluation (LREC), 2022
Gabriele Sarti
Malvina Nissim
AILaw
246
51
0
07 Mar 2022
TransKD: Transformer Knowledge Distillation for Efficient Semantic
  Segmentation
TransKD: Transformer Knowledge Distillation for Efficient Semantic Segmentation
R. Liu
Kailun Yang
Alina Roitberg
Kailai Li
Kunyu Peng
Huayao Liu
Yaonan Wang
Rainer Stiefelhagen
ViT
273
55
0
27 Feb 2022
Transformer Quality in Linear Time
Transformer Quality in Linear TimeInternational Conference on Machine Learning (ICML), 2022
Weizhe Hua
Zihang Dai
Hanxiao Liu
Quoc V. Le
478
299
0
21 Feb 2022
ST-MoE: Designing Stable and Transferable Sparse Expert Models
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Barret Zoph
Irwan Bello
Sameer Kumar
Nan Du
Yanping Huang
J. Dean
Noam M. Shazeer
W. Fedus
MoE
422
298
0
17 Feb 2022
VRT: A Video Restoration Transformer
VRT: A Video Restoration TransformerIEEE Transactions on Image Processing (IEEE TIP), 2022
Christos Sakaridis
Jingyun Liang
Yuchen Fan
Lucas Beerens
Rakesh Ranjan
Yawei Li
Radu Timofte
Luc Van Gool
ViT
363
339
0
28 Jan 2022
LaMDA: Language Models for Dialog Applications
LaMDA: Language Models for Dialog Applications
R. Thoppilan
Daniel De Freitas
Jamie Hall
Noam M. Shazeer
Apoorv Kulshreshtha
...
Blaise Aguera-Arcas
Claire Cui
M. Croak
Ed H. Chi
Quoc Le
ALM
379
1,784
0
20 Jan 2022
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
Nan Du
Yanping Huang
Andrew M. Dai
Simon Tong
Dmitry Lepikhin
...
Kun Zhang
Quoc V. Le
Yonghui Wu
Zhiwen Chen
Claire Cui
ALMMoE
694
1,056
0
13 Dec 2021
ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning
ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning
V. Aribandi
Yi Tay
Tal Schuster
J. Rao
H. Zheng
...
Jianmo Ni
Jai Gupta
Kai Hui
Sebastian Ruder
Donald Metzler
MoE
307
230
0
22 Nov 2021
A Multi-attribute Controllable Generative Model for Histopathology Image
  Synthesis
A Multi-attribute Controllable Generative Model for Histopathology Image SynthesisInternational Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2021
Jiarong Ye
Yuan Xue
Peter Liu
R. Zaino
K. Cheng
Xiaolei Huang
MedIm
116
10
0
10 Nov 2021
Geometric Transformer for End-to-End Molecule Properties Prediction
Geometric Transformer for End-to-End Molecule Properties Prediction
Yoni Choukroun
Lior Wolf
AI4CEViT
247
19
0
26 Oct 2021
NormFormer: Improved Transformer Pretraining with Extra Normalization
NormFormer: Improved Transformer Pretraining with Extra Normalization
Sam Shleifer
Jason Weston
Myle Ott
AI4CE
275
86
0
18 Oct 2021
The Neural Data Router: Adaptive Control Flow in Transformers Improves
  Systematic Generalization
The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization
Róbert Csordás
Kazuki Irie
Jürgen Schmidhuber
AI4CE
409
64
0
14 Oct 2021
Primer: Searching for Efficient Transformers for Language Modeling
Primer: Searching for Efficient Transformers for Language Modeling
David R. So
Wojciech Mañke
Hanxiao Liu
Zihang Dai
Noam M. Shazeer
Quoc V. Le
VLM
401
184
0
17 Sep 2021
SANSformers: Self-Supervised Forecasting in Electronic Health Records
  with Attention-Free Models
SANSformers: Self-Supervised Forecasting in Electronic Health Records with Attention-Free Models
Yogesh Kumar
Alexander Ilin
H. Salo
S. Kulathinal
M. Leinonen
Pekka Marttinen
AI4TSMedIm
304
0
0
31 Aug 2021
Sequence-to-Sequence Piano Transcription with Transformers
Sequence-to-Sequence Piano Transcription with TransformersInternational Society for Music Information Retrieval Conference (ISMIR), 2021
Curtis Hawthorne
Ian Simon
Rigel Swavely
Ethan Manilow
Jesse Engel
331
98
0
19 Jul 2021
MedGPT: Medical Concept Prediction from Clinical Narratives
MedGPT: Medical Concept Prediction from Clinical Narratives
Z. Kraljevic
Anthony Shek
D. Bean
R. Bendayan
J. Teo
Richard J. B. Dobson
LM&MAAI4TSMedIm
202
48
0
07 Jul 2021
Charformer: Fast Character Transformers via Gradient-based Subword
  Tokenization
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization
Yi Tay
Vinh Q. Tran
Sebastian Ruder
Jai Gupta
Hyung Won Chung
Dara Bahri
Zhen Qin
Simon Baumgartner
Cong Yu
Donald Metzler
342
187
0
23 Jun 2021
Revisiting Deep Learning Models for Tabular Data
Revisiting Deep Learning Models for Tabular DataNeural Information Processing Systems (NeurIPS), 2021
Yu. V. Gorishniy
Ivan Rubachev
Valentin Khrulkov
Artem Babenko
LMTD
523
1,069
0
22 Jun 2021
Distributed Deep Learning in Open Collaborations
Distributed Deep Learning in Open CollaborationsNeural Information Processing Systems (NeurIPS), 2021
Michael Diskin
Alexey Bukhtiyarov
Max Ryabinin
Lucile Saulnier
Quentin Lhoest
...
Denis Mazur
Ilia Kobelev
Yacine Jernite
Thomas Wolf
Gennady Pekhimenko
FedML
278
73
0
18 Jun 2021
Memory-efficient Transformers via Top-$k$ Attention
Memory-efficient Transformers via Top-kkk Attention
Ankit Gupta
Guy Dar
Shaya Goodman
David Ciprut
Jonathan Berant
MQ
245
77
0
13 Jun 2021
A Survey of Transformers
A Survey of TransformersAI Open (AO), 2021
Tianyang Lin
Yuxin Wang
Xiangyang Liu
Xipeng Qiu
ViT
441
1,380
0
08 Jun 2021
Pay Attention to MLPs
Pay Attention to MLPsNeural Information Processing Systems (NeurIPS), 2021
Hanxiao Liu
Zihang Dai
David R. So
Quoc V. Le
AI4CE
574
796
0
17 May 2021
The Power of Scale for Parameter-Efficient Prompt Tuning
The Power of Scale for Parameter-Efficient Prompt TuningConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Brian Lester
Rami Al-Rfou
Noah Constant
VPVLM
1.4K
4,984
0
18 Apr 2021
Do Transformer Modifications Transfer Across Implementations and
  Applications?
Do Transformer Modifications Transfer Across Implementations and Applications?Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Sharan Narang
Hyung Won Chung
Yi Tay
W. Fedus
Thibault Févry
...
Wei Li
Nan Ding
Jake Marcus
Adam Roberts
Colin Raffel
215
134
0
23 Feb 2021
Previous
123...171819
Next