ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2002.05202
  4. Cited By
GLU Variants Improve Transformer

GLU Variants Improve Transformer

12 February 2020
Noam M. Shazeer
ArXiv (abs)PDFHTMLHuggingFace (4 upvotes)

Papers citing "GLU Variants Improve Transformer"

50 / 904 papers shown
dots.llm1 Technical Report
dots.llm1 Technical Report
Bi Huo
Bin Tu
Cheng Qin
Da Zheng
Debing Zhang
...
Yuqiu Ji
Ze Wen
Zhenhai Liu
Zichao Li
Zilong Liao
MoE
198
3
0
06 Jun 2025
Scaling Transformers for Discriminative Recommendation via Generative Pretraining
Scaling Transformers for Discriminative Recommendation via Generative Pretraining
Chunqi Wang
Bingchao Wu
Z. Chen
Lei Shen
Bing Wang
Xiaoyi Zeng
355
6
0
04 Jun 2025
Scaling Fine-Grained MoE Beyond 50B Parameters: Empirical Evaluation and Practical Insights
Scaling Fine-Grained MoE Beyond 50B Parameters: Empirical Evaluation and Practical Insights
Jakub Krajewski
Marcin Chochowski
Daniel Korzekwa
MoEALM
207
0
0
03 Jun 2025
How Programming Concepts and Neurons Are Shared in Code Language Models
How Programming Concepts and Neurons Are Shared in Code Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Amir Hossein Kargaran
Yihong Liu
François Yvon
Hinrich Schütze
196
3
0
01 Jun 2025
Equivalent Linear Mappings of Large Language Models
Equivalent Linear Mappings of Large Language Models
James R. Golden
172
0
0
30 May 2025
Differential Gated Self-Attention
Differential Gated Self-Attention
Elpiniki Maria Lygizou
Mónika Farsang
Radu Grosu
200
0
0
29 May 2025
Continuous Chain of Thought Enables Parallel Exploration and Reasoning
Continuous Chain of Thought Enables Parallel Exploration and Reasoning
Halil Alperen Gozeten
M. E. Ildiz
Xuechen Zhang
Hrayr Harutyunyan
A. S. Rawat
Samet Oymak
LRM
396
8
0
29 May 2025
Exploring Scaling Laws for EHR Foundation Models
Exploring Scaling Laws for EHR Foundation Models
Sheng Zhang
Qin Liu
Naoto Usuyama
Cliff Wong
Tristan Naumann
Hoifung Poon
219
0
0
29 May 2025
RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination
RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination
Chong Zeng
Yue Dong
Pieter Peers
Hongzhi Wu
Xin Tong
196
6
0
28 May 2025
Automated Essay Scoring Incorporating Annotations from Automated Feedback Systems
Automated Essay Scoring Incorporating Annotations from Automated Feedback Systems
Christopher Ormerod
246
0
0
28 May 2025
FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference
FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference
Aniruddha Nrusimha
William Brandon
Mayank Mishra
Yikang Shen
Rameswar Panda
Jonathan Ragan-Kelley
Yoon Kim
VLM
213
1
0
28 May 2025
HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer
HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer
Qi Cai
Jingwen Chen
Yang Chen
Yehao Li
Fuchen Long
...
Rui Tian
Siyu Wang
Bo Zhao
Ting Yao
Tao Mei
VLM
195
69
0
28 May 2025
Learning in Compact Spaces with Approximately Normalized Transformer
Learning in Compact Spaces with Approximately Normalized Transformer
Jörg Franke
Urs Spiegelhalter
Marianna Nezhurina
J. Jitsev
Katharina Eggensperger
Michael Hefenbrock
263
1
0
28 May 2025
In Search of Adam's Secret Sauce
In Search of Adam's Secret Sauce
Antonio Orvieto
Robert Gower
369
12
0
27 May 2025
How does Alignment Enhance LLMs' Multilingual Capabilities? A Language Neurons Perspective
How does Alignment Enhance LLMs' Multilingual Capabilities? A Language Neurons Perspective
Shimao Zhang
Z. Lai
Xiang Liu
Shuaijie She
Xiao Liu
Yeyun Gong
Shujian Huang
Jiajun Chen
302
1
0
27 May 2025
Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders
Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders
James Oldfield
Shawn Im
Yixuan Li
M. Nicolaou
Ioannis Patras
Grigorios G. Chrysos
MoE
317
0
0
27 May 2025
LlamaSeg: Image Segmentation via Autoregressive Mask Generation
LlamaSeg: Image Segmentation via Autoregressive Mask Generation
Jiru Deng
Tengjin Weng
Tianyu Yang
Tong Lu
Zhiheng Li
Wenhao Jiang
VLM
364
0
0
26 May 2025
Understanding Transformer from the Perspective of Associative Memory
Understanding Transformer from the Perspective of Associative Memory
Shu Zhong
Mingyu Xu
Tenglong Ao
Guang Shi
226
12
0
26 May 2025
Burst Image Super-Resolution via Multi-Cross Attention Encoding and Multi-Scan State-Space Decoding
Burst Image Super-Resolution via Multi-Cross Attention Encoding and Multi-Scan State-Space DecodingImage and Vision Computing (IVC), 2025
Tengda Huang
Yu Zhang
Tianren Li
Yufu Qu
Fulin Liu
Zhenzhong Wei
SupR
221
0
0
26 May 2025
Towards Fully FP8 GEMM LLM Training at Scale
Towards Fully FP8 GEMM LLM Training at Scale
Alejandro Hernández Cano
Dhia Garbaya
Imanol Schlag
Martin Jaggi
MQ
364
2
0
26 May 2025
LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
Fengqi Zhu
Rongzhen Wang
Shen Nie
Xiaolu Zhang
Chunwei Wu
...
Jun Zhou
Jianfei Chen
Yankai Lin
Ji-Rong Wen
Chongxuan Li
431
93
0
25 May 2025
Why Do Some Inputs Break Low-Bit LLM Quantization?
Why Do Some Inputs Break Low-Bit LLM Quantization?
Ting-Yun Chang
Muru Zhang
Jesse Thomason
Robin Jia
MQ
273
1
0
24 May 2025
How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation
How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation
Xin Lu
Yanyan Zhao
Si Wei
Shijin Wang
Bing Qin
Ting Liu
217
0
0
24 May 2025
COUNTDOWN: Contextually Sparse Activation Filtering Out Unnecessary Weights in Down Projection
COUNTDOWN: Contextually Sparse Activation Filtering Out Unnecessary Weights in Down Projection
Jaewon Cheon
Pilsung Kang
350
0
0
23 May 2025
PLUMAGE: Probabilistic Low rank Unbiased Min Variance Gradient Estimator for Efficient Large Model Training
PLUMAGE: Probabilistic Low rank Unbiased Min Variance Gradient Estimator for Efficient Large Model Training
Matan Haroush
Daniel Soudry
362
0
0
23 May 2025
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design
Benjamin Schneider
Dongfu Jiang
Chao Du
Tianyu Pang
Wenhu Chen
VLM
235
4
0
22 May 2025
Locate-then-Merge: Neuron-Level Parameter Fusion for Mitigating Catastrophic Forgetting in Multimodal LLMs
Locate-then-Merge: Neuron-Level Parameter Fusion for Mitigating Catastrophic Forgetting in Multimodal LLMs
Zeping Yu
Sophia Ananiadou
MoMeKELMCLL
258
1
0
22 May 2025
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
Zebin You
Shen Nie
Xiaolu Zhang
Jun Hu
Jun Zhou
Zhiwu Lu
J. Wen
Chongxuan Li
MLLMVLM
434
61
0
22 May 2025
Amplify Adjacent Token Differences: Enhancing Long Chain-of-Thought Reasoning with Shift-FFN
Amplify Adjacent Token Differences: Enhancing Long Chain-of-Thought Reasoning with Shift-FFN
Yao Xu
Mingyu Xu
Fangyu Lei
Wangtao Sun
Xiangrong Zeng
Bingning Wang
Guang Liu
Shizhu He
Jun Zhao
Kang Liu
LRM
238
1
0
22 May 2025
MaxPoolBERT: Enhancing BERT Classification via Layer- and Token-Wise Aggregation
MaxPoolBERT: Enhancing BERT Classification via Layer- and Token-Wise Aggregation
Maike Behrendt
Stefan Sylvius Wagner
Stefan Harmeling
SSeg
523
1
0
21 May 2025
BanglaByT5: Byte-Level Modelling for Bangla
BanglaByT5: Byte-Level Modelling for Bangla
Pramit Bhattacharyya
Arnab Bhattacharya
257
1
0
21 May 2025
Guarded Query Routing for Large Language Models
Guarded Query Routing for Large Language Models
Richard Šléher
William Brach
Tibor Sloboda
Kristián Košťál
Lukas Galke
RALM
422
0
0
20 May 2025
This Time is Different: An Observability Perspective on Time Series Foundation Models
This Time is Different: An Observability Perspective on Time Series Foundation Models
Ben Cohen
Emaad Khwaja
Youssef Doubli
Salahidine Lemaachi
Chris Lettieri
...
Zongzhe Xu
Viktoriya Zhukova
David Asker
Ameet Talwalkar
Othmane Abou-Amal
AI4TSAI4CE
492
11
0
20 May 2025
Scaling Law for Quantization-Aware Training
Scaling Law for Quantization-Aware Training
Mengzhao Chen
Chaoyi Zhang
Jing Liu
Yutao Zeng
Zeyue Xue
...
Yunshui Li
Jin Ma
Jie Huang
Xun Zhou
Ping Luo
MQ
279
9
0
20 May 2025
Output Scaling: YingLong-Delayed Chain of Thought in a Large Pretrained Time Series Forecasting Model
Output Scaling: YingLong-Delayed Chain of Thought in a Large Pretrained Time Series Forecasting Model
Qingsong Wen
Tian Zhou
Jinyang Gao
Bolin Ding
Jingren Zhou
AI4TSAI4CELRM
236
7
0
20 May 2025
Systematic Generalization in Language Models Scales with Information Entropy
Systematic Generalization in Language Models Scales with Information EntropyAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Sondre Wold
Lucas Georges Gabriel Charpentier
Étienne Simon
442
0
0
19 May 2025
A3 : an Analytical Low-Rank Approximation Framework for Attention
A3 : an Analytical Low-Rank Approximation Framework for Attention
Jeffrey T. H. Wong
Cheng Zhang
Xinye Cao
Pedro Gimenes
George A. Constantinides
Wayne Luk
Yiren Zhao
OffRLMQ
365
3
0
19 May 2025
Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training
Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training
Shane Bergsma
Nolan Dey
Gurpreet Gosal
Gavia Gray
Daria Soboleva
Joel Hestness
384
14
0
19 May 2025
Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space
Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space
Zhengrui Ma
Yang Feng
Chenze Shao
Fandong Meng
Jie Zhou
Min Zhang
262
3
0
19 May 2025
PiT: Progressive Diffusion Transformer
PiT: Progressive Diffusion Transformer
Jiafu Wu
Yabiao Wang
Jian Li
Jinlong Peng
Yun Cao
Chengjie Wang
Jiangning Zhang
614
0
0
19 May 2025
Unveiling Knowledge Utilization Mechanisms in LLM-based Retrieval-Augmented Generation
Unveiling Knowledge Utilization Mechanisms in LLM-based Retrieval-Augmented GenerationAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2025
Yuhao Wang
Ruiyang Ren
Yucheng Wang
Wayne Xin Zhao
Jing Liu
Hua Wu
Haifeng Wang
222
0
0
17 May 2025
Efficiently Building a Domain-Specific Large Language Model from Scratch: A Case Study of a Classical Chinese Large Language Model
Efficiently Building a Domain-Specific Large Language Model from Scratch: A Case Study of a Classical Chinese Large Language Model
Shen Li
Renfen Hu
Lijun Wang
ALM
201
0
0
17 May 2025
Chain-of-Model Learning for Language Model
Chain-of-Model Learning for Language Model
Kaitao Song
Xiaohua Wang
Xu Tan
Huiqiang Jiang
Chengruidong Zhang
...
Xiaoqing Zheng
Tao Qin
Yuqing Yang
Dongsheng Li
Lili Qiu
LRMAI4CE
488
1
0
17 May 2025
MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production
MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production
Cheng Jin
Ziheng Jiang
Zhihao Bai
Zheng Zhong
Jing Liu
...
Yanghua Peng
Xuanzhe Liu
Xuanzhe Liu
Xin Jin
Xin Liu
MoE
411
7
0
16 May 2025
Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput
Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput
Bo Zhang
Shuo Li
Runhe Tian
Yang Yang
Jixin Tang
Jinhao Zhou
Lin Ma
VLM
261
5
0
14 May 2025
Large Language Models for Computer-Aided Design: A Survey
Large Language Models for Computer-Aided Design: A Survey
Licheng Zhang
Bach Le
Naveed Akhtar
Siew-Kei Lam
Tuan Ngo
3DVAI4CE
391
9
0
13 May 2025
DELPHYNE: A Pre-Trained Model for General and Financial Time Series
DELPHYNE: A Pre-Trained Model for General and Financial Time Series
Xueying Ding
Aakriti Mittal
Achintya Gopal
AI4TS
123
1
0
12 May 2025
Circuit Partitioning Using Large Language Models for Quantum Compilation and Simulations
Circuit Partitioning Using Large Language Models for Quantum Compilation and Simulations
Pranav Sinha
Sumit Kumar Jha
Sunny Raj
229
2
0
12 May 2025
Comet: Accelerating Private Inference for Large Language Model by Predicting Activation Sparsity
Comet: Accelerating Private Inference for Large Language Model by Predicting Activation SparsityIEEE Symposium on Security and Privacy (S&P), 2025
Guang Yan
Yuhui Zhang
Zimu Guo
Lutan Zhao
Xiaojun Chen
Chen Wang
Wenhao Wang
Dan Meng
Rui Hou
293
2
0
12 May 2025
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Zihan Qiu
Zhaoxiang Wang
Bo Zheng
Zeyu Huang
Kaiyue Wen
...
Fei Huang
Suozhi Huang
Dayiheng Liu
Jingren Zhou
Junyang Lin
MoE
900
40
0
10 May 2025
Previous
123456...171819
Next