ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2002.05202
  4. Cited By
GLU Variants Improve Transformer

GLU Variants Improve Transformer

12 February 2020
Noam M. Shazeer
ArXiv (abs)PDFHTMLHuggingFace (4 upvotes)

Papers citing "GLU Variants Improve Transformer"

50 / 905 papers shown
Maximum Score Routing For Mixture-of-Experts
Maximum Score Routing For Mixture-of-ExpertsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Bowen Dong
Yilong Fan
Yutao Sun
Zhenyu Li
Tengyu Pan
Xun Zhou
Jianyong Wang
MoE
120
2
0
18 Aug 2025
CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems
CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems
Xuran Liu
Nan Xue
Rui Bao
Yaping Sun
Zhiyong Chen
Meixia Tao
Xiaodong Xu
Shuguang Cui
128
0
0
15 Aug 2025
Efficient Patent Searching Using Graph Transformers
Efficient Patent Searching Using Graph Transformers
Krzysztof Daniell
Igor Buzhinsky
Sebastian Björkqvist
MedIm
145
1
0
14 Aug 2025
FuXi-β: Towards a Lightweight and Fast Large-Scale Generative Recommendation Model
FuXi-β: Towards a Lightweight and Fast Large-Scale Generative Recommendation Model
Yufei Ye
Wei Guo
Hao Wang
Hong Zhu
Yuyang Ye
Yong Liu
Huifeng Guo
Ruiming Tang
Defu Lian
Tong Xu
149
2
0
14 Aug 2025
MambaTrans: Multimodal Fusion Image Translation via Large Language Model Priors for Downstream Visual Tasks
MambaTrans: Multimodal Fusion Image Translation via Large Language Model Priors for Downstream Visual Tasks
Yushen Xu
Xiaosong Li
Zhenyu Kuang
Xiaoqi Cheng
Haishu Tan
Huafeng Li
116
0
0
11 Aug 2025
AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning
AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning
Shihao Yuan
Yahui Liu
Yang Yue
Jingyuan Zhang
Wangmeng Zuo
Qi Wang
Fuzheng Zhang
Guorui Zhou
EGVMVLM
148
11
0
09 Aug 2025
gpt-oss-120b & gpt-oss-20b Model Card
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI
Sandhini Agarwal
Lama Ahmad
Jason Ai
Sam Altman
...
D. Sculley
Harshit Sikchi
Kendal Simon
K. Singhal
Yang Song
LRMVLM
137
282
0
08 Aug 2025
MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows
MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows
Xiquan Li
Junxi Liu
Yuzhe Liang
Zhikang Niu
Wenxi Chen
Xie Chen
260
2
0
08 Aug 2025
MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs
MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs
Xiaodong Chen
Mingming Ha
Zhenzhong Lan
Jing Zhang
Jianguo Li
MoE
128
1
0
07 Aug 2025
Channel-Wise MLPs Improve the Generalization of Recurrent Convolutional Networks
Channel-Wise MLPs Improve the Generalization of Recurrent Convolutional Networks
Nathan Breslow
AI4CE
68
0
0
06 Aug 2025
Markov Chain Estimation with In-Context Learning
Markov Chain Estimation with In-Context Learning
Simon Lepage
Jérémie Mary
David Picard
110
1
0
05 Aug 2025
H3R: Hybrid Multi-view Correspondence for Generalizable 3D Reconstruction
H3R: Hybrid Multi-view Correspondence for Generalizable 3D Reconstruction
Heng Jia
Linchao Zhu
Na Zhao
3DGS
190
0
0
05 Aug 2025
Parameter-Efficient Routed Fine-Tuning: Mixture-of-Experts Demands Mixture of Adaptation Modules
Parameter-Efficient Routed Fine-Tuning: Mixture-of-Experts Demands Mixture of Adaptation Modules
Yilun Liu
Yunpu Ma
Yuetian Lu
Shuo Chen
Zifeng Ding
Volker Tresp
MoE
124
0
0
04 Aug 2025
LOST: Low-rank and Sparse Pre-training for Large Language Models
LOST: Low-rank and Sparse Pre-training for Large Language Models
Jiaxi Li
Lu Yin
Li Shen
Jinjin Xu
Liwu Xu
Tianjin Huang
Wenwu Wang
Shiwei Liu
Xilu Wang
155
2
0
04 Aug 2025
Learning Dynamics of Meta-Learning in Small Model Pretraining
Learning Dynamics of Meta-Learning in Small Model Pretraining
David Demitri Africa
Yuval Weiss
P. Buttery
Richard Diehl Martinez
AI4CE
214
2
0
04 Aug 2025
MHARFedLLM: Multimodal Human Activity Recognition Using Federated Large Language Model
MHARFedLLM: Multimodal Human Activity Recognition Using Federated Large Language Model
Asmit Bandyopadhyay
Rohit Basu
Tanmay Sen
Swagatam Das
AI4CE
119
1
0
03 Aug 2025
ChEmbed: Enhancing Chemical Literature Search Through Domain-Specific Text Embeddings
ChEmbed: Enhancing Chemical Literature Search Through Domain-Specific Text Embeddings
Ali Shiraee Kasmaee
Mohammad Khodadad
Mehdi Astaraki
Mohammad Arshi Saloot
Nicholas Sherck
H. Mahyar
Soheila Samiee
158
3
0
03 Aug 2025
Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler
Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler
Aleksandr Dremov
Alexander Hägele
Atli Kosson
Martin Jaggi
208
4
0
02 Aug 2025
On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective
On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective
Gabriel Mongaras
Eric C. Larson
113
1
0
31 Jul 2025
GovRelBench:A Benchmark for Government Domain Relevance
GovRelBench:A Benchmark for Government Domain Relevance
Haiquan Wang
Yi Chen
Shang Zeng
Yun Bian
Zhe Cui
188
0
0
29 Jul 2025
Doubling Your Data in Minutes: Ultra-fast Tabular Data Generation via LLM-Induced Dependency Graphs
Doubling Your Data in Minutes: Ultra-fast Tabular Data Generation via LLM-Induced Dependency Graphs
Shuo Yang
Zheyu Zhang
Bardh Prenkaj
Gjergji Kasneci
201
4
0
25 Jul 2025
DIFFA: Large Language Diffusion Models Can Listen and Understand
DIFFA: Large Language Diffusion Models Can Listen and Understand
Jiaming Zhou
Hongjie Chen
Shiwan Zhao
Jian Kang
Jie Li
...
Haoqin Sun
Hui Wang
Aobo Kong
Yong Qin
X. Li
226
3
0
24 Jul 2025
Adaptive Neural Quantum States: A Recurrent Neural Network Perspective
Adaptive Neural Quantum States: A Recurrent Neural Network Perspective
Jake McNaughton
Mohamed Hibat-Allah
83
0
0
24 Jul 2025
GRR-CoCa: Leveraging LLM Mechanisms in Multimodal Model Architectures
GRR-CoCa: Leveraging LLM Mechanisms in Multimodal Model Architectures
Jake R. Patock
Nicole Catherine Lewis
Kevin McCoy
Christina Gomez
Canling Chen
Lorenzo Luzi
VLM
144
0
0
24 Jul 2025
Technical Report of TeleChat2, TeleChat2.5 and T1
Technical Report of TeleChat2, TeleChat2.5 and T1
Zihan Wang
Xinzhang Liu
Yitong Yao
Chao Wang
Yu Zhao
...
Bingkai Yang
Shuangyong Song
Yongxiang Li
Zhongjiang He
Xuelong Li
AI4TSLRM
428
6
0
24 Jul 2025
The Early Bird Identifies the Worm: You Can't Beat a Head Start in Long-Term Body Re-ID (ECHO-BID)
The Early Bird Identifies the Worm: You Can't Beat a Head Start in Long-Term Body Re-ID (ECHO-BID)
Thomas M. Metz
Matthew Q. Hill
A. O’toole
241
1
0
23 Jul 2025
Scaling Linear Attention with Sparse State Expansion
Scaling Linear Attention with Sparse State Expansion
Yuqi Pan
Yongqi An
Zheng Li
Yuhong Chou
Ruijie Zhu
Xiaohui Wang
Mingxuan Wang
Jinqiao Wang
Guoqi Li
298
0
0
22 Jul 2025
Supernova: Achieving More with Less in Transformer Architectures
Supernova: Achieving More with Less in Transformer Architectures
Andrei-Valentin Tanase
Elena Pelican
164
0
0
21 Jul 2025
Diffusion Beats Autoregressive in Data-Constrained Settings
Diffusion Beats Autoregressive in Data-Constrained Settings
Mihir Prabhudesai
Menging Wu
Amir Zadeh
Katerina Fragkiadaki
Deepak Pathak
DiffM
343
24
0
21 Jul 2025
Rethinking LLM Inference Bottlenecks: Insights from Latent Attention and Mixture-of-Experts
Rethinking LLM Inference Bottlenecks: Insights from Latent Attention and Mixture-of-Experts
Sungmin Yun
Seonyong Park
Hwayong Nam
Younjoo Lee
Gunjun Lee
...
Jongmin Kim
Hyungyo Kim
Juhwan Cho
Seungmin Baek
Jung Ho Ahn
MoE
219
5
0
21 Jul 2025
Latent Denoising Makes Good Visual Tokenizers
Latent Denoising Makes Good Visual Tokenizers
Jiawei Yang
Tianhong Li
Lijie Fan
Yonglong Tian
Yue Wang
196
13
0
21 Jul 2025
Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training
Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training
Minhak Song
Beomhan Baek
Kwangjun Ahn
Chulhee Yun
CLLAI4CE
293
2
0
14 Jul 2025
Scaling Laws for Optimal Data Mixtures
Scaling Laws for Optimal Data Mixtures
Mustafa Shukor
Louis Béthune
Dan Busbridge
David Grangier
Enrico Fini
Alaaeldin El-Nouby
Pierre Ablin
210
11
0
12 Jul 2025
BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity
BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity
Chenyang Song
Weilin Zhao
Xu Han
Chaojun Xiao
Yingfa Chen
Yuxuan Li
Zhiyuan Liu
Maosong Sun
MoE
263
0
0
11 Jul 2025
Memory Mosaics at scale
Memory Mosaics at scale
Jianyu Zhang
Léon Bottou
CLL
344
3
0
04 Jul 2025
Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models
Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models
Lauren Hyoseo Yoon
Yisong Yue
Been Kim
380
0
0
01 Jul 2025
FlatFusion: Delving into Details of Sparse Transformer-based Camera-LiDAR Fusion for Autonomous Driving
FlatFusion: Delving into Details of Sparse Transformer-based Camera-LiDAR Fusion for Autonomous Driving
Yutao Zhu
Xiaosong Jia
Xinyu Yang
Junchi Yan
ViT
276
11
0
01 Jul 2025
Hierarchical Reasoning Model
Hierarchical Reasoning Model
Guan Wang
Jin Li
Yuhao Sun
Xing Chen
Changling Liu
Yue Wu
Meng Lu
Sen Song
Yasin Abbasi Yadkori
LRM
519
45
0
26 Jun 2025
SKOLR: Structured Koopman Operator Linear RNN for Time-Series Forecasting
SKOLR: Structured Koopman Operator Linear RNN for Time-Series Forecasting
Yitian Zhang
Liheng Ma
Antonios Valkanas
Boris N. Oreshkin
Mark Coates
AI4TS
263
3
0
17 Jun 2025
Self-supervised Representation Learning with Local Aggregation for Image-based Profiling
Self-supervised Representation Learning with Local Aggregation for Image-based Profiling
Siran Dai
Qianqian Xu
Peisong Wen
Yang Liu
Qingming Huang
306
2
0
17 Jun 2025
Load Balancing Mixture of Experts with Similarity Preserving Routers
Load Balancing Mixture of Experts with Similarity Preserving Routers
Nabil Omi
S. Sen
Ali Farhadi
MoE
287
7
0
16 Jun 2025
Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems
Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems
Tuan Nguyen
Long-Vu Hoang
Huy-Dat Tran
224
3
0
16 Jun 2025
GTA: Grouped-head latenT Attention
GTA: Grouped-head latenT Attention
Luoyang Sun
Cheng Deng
Jiwen Jiang
Xinjian Wu
Haifeng Zhang
Lei Chen
Lionel M. Ni
Ning Yang
177
1
0
15 Jun 2025
Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling
Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling
Teodora Srećković
Jonas Geiping
Antonio Orvieto
MoE
203
5
0
14 Jun 2025
BSA: Ball Sparse Attention for Large-scale Geometries
BSA: Ball Sparse Attention for Large-scale Geometries
Catalin E. Brita
Hieu Nguyen
Lohithsai Yadala Chanchu
Domonkos Nagy
Maksim Zhdanov
215
0
0
14 Jun 2025
One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers
One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers
Diana Abagyan
Alejandro Salamanca
Andres Felipe Cruz-Salinas
Kris Cao
Hangyu Lin
Acyr Locatelli
Marzieh Fadaee
Ahmet Üstün
Sara Hooker
CLL
376
5
0
12 Jun 2025
DIVE into MoE: Diversity-Enhanced Reconstruction of Large Language Models from Dense into Mixture-of-Experts
DIVE into MoE: Diversity-Enhanced Reconstruction of Large Language Models from Dense into Mixture-of-ExpertsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Yuchen Feng
Bowen Shen
Naibin Gu
Jiaxuan Zhao
Peng Fu
Zheng Lin
Weiping Wang
MoMeMoE
210
4
0
11 Jun 2025
ABC-FHE : A Resource-Efficient Accelerator Enabling Bootstrappable Parameters for Client-Side Fully Homomorphic EncryptionDesign Automation Conference (DAC), 2025
Sungwoong Yune
Hyojeong Lee
Adiwena Putra
Hyunjun Cho
Cuong Duong Manh
Jaeho Jeon
Joo-Young Kim
325
43
0
10 Jun 2025
CausalPFN: Amortized Causal Effect Estimation via In-Context Learning
CausalPFN: Amortized Causal Effect Estimation via In-Context Learning
Vahid Balazadeh
Hamidreza Kamkari
Valentin Thomas
Benson Li
Junwei Ma
Jesse C. Cresswell
Rahul G. Krishnan
CML
201
6
0
09 Jun 2025
Learning Distribution-Wise Control in Representation Space for Language Models
Learning Distribution-Wise Control in Representation Space for Language Models
Chunyuan Deng
Ruidi Chang
Hanjie Chen
274
2
0
07 Jun 2025
Previous
12345...171819
Next
Page 4 of 19
Pageof 19