ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.06708
  4. Cited By
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

10 May 2025
Zihan Qiu
Zhaoxiang Wang
Bo Zheng
Zeyu Huang
Kaiyue Wen
Shangshang Yang
Rui Men
Le Yu
Fei Huang
Suozhi Huang
Dayiheng Liu
Jingren Zhou
Junyang Lin
    MoE
ArXiv (abs)PDFHTML

Papers citing "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free"

30 / 30 papers shown
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling
Jack Cook
Junxian Guo
Guangxuan Xiao
Yujun Lin
Song Han
MQ
418
3
0
01 Dec 2025
TGSFormer: Scalable Temporal Gaussian Splatting for Embodied Semantic Scene Completion
TGSFormer: Scalable Temporal Gaussian Splatting for Embodied Semantic Scene Completion
Rui Qian
Haozhi Cao
Tianchen Deng
Tianxin Hu
Weixiang Guo
Shenghai Yuan
Lihua Xie
3DGS
150
0
0
29 Nov 2025
TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies
TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies
Guang Liang
Jie Shao
Ningyuan Tang
Xinyao Liu
Jianxin Wu
MQ
219
1
0
28 Nov 2025
SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space
SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space
Zhenyi Shen
Junru Lu
Lin Gui
Jiazheng Li
Yulan He
D. Yin
Xing Sun
367
0
0
25 Nov 2025
Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design
Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design
Quentin G. Anthony
Yury Tokpanov
Skyler Szot
Srivatsan Rajagopal
Praneeth Medepalli
...
Emad Barsoum
Zhenyu Gu
Yao Fu
Beren Millidge
Beren Millidge
MoEVLMLRM
292
0
0
21 Nov 2025
CAMS: Towards Compositional Zero-Shot Learning via Gated Cross-Attention and Multi-Space Disentanglement
Pan Yang
Cheng Deng
J. Yang
Han Zhao
Yun-Hai Liu
Yuling Chen
Xiaoli Ruan
Yanping Chen
CoGe
340
0
0
20 Nov 2025
TNT: Improving Chunkwise Training for Test-Time Memorization
TNT: Improving Chunkwise Training for Test-Time Memorization
Zeman Li
Ali Behrouz
Yuan Deng
Peilin Zhong
Praneeth Kacham
Mahdi Karami
Meisam Razaviyayn
Vahab Mirrokni
239
2
0
10 Nov 2025
TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control
TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control
Yuxiang Chen
Xiaoming Xu
Pengle Zhang
Michael Beyer
Martin Rapp
Jun Zhu
Jianfei Chen
MQ
157
4
0
31 Oct 2025
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Team
Yu Zhang
Zongyu Lin
Xingcheng Yao
J. Hu
...
Guokun Lai
Yuxin Wu
Xinyu Zhou
Zhilin Yang
Yulun Du
157
28
0
30 Oct 2025
Knocking-Heads Attention
Knocking-Heads Attention
Zhanchao Zhou
Xiaodong Chen
Haoxing Chen
Zhenzhong Lan
Jianguo Li
122
0
0
27 Oct 2025
Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs
Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs
Anand
Umberto Cappellazzo
Stavros Petridis
Maja Pantic
175
0
0
26 Oct 2025
A Hierarchical Quantized Tokenization Framework for Task-Adaptive Graph Representation Learning
A Hierarchical Quantized Tokenization Framework for Task-Adaptive Graph Representation Learning
Yang Xiang
Li Fan
Chenke Yin
Chengtao Ji
148
0
0
14 Oct 2025
Value-State Gated Attention for Mitigating Extreme-Token Phenomena in Transformers
Value-State Gated Attention for Mitigating Extreme-Token Phenomena in Transformers
Rui Bu
Haofeng Zhong
Wenzheng Chen
Yangyan Li
192
0
0
10 Oct 2025
Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
Wenjie Du
Li Jiang
Keda Tao
Xue Liu
Huan Wang
LRM
177
5
0
09 Oct 2025
Revisiting Long-context Modeling from Context Denoising Perspective
Revisiting Long-context Modeling from Context Denoising Perspective
Zecheng Tang
Baibei Ji
Juntao Li
Lijun Wu
Haijia Gui
Min Zhang
188
1
0
07 Oct 2025
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
Sangmin Bae
Bilge Acun
Haroun Habeeb
S. Kim
Chien-Yu Lin
Liang Luo
Junjie Wang
Carole-Jean Wu
168
5
0
06 Oct 2025
Effective Model Pruning: Measure The Redundancy of Model Components
Effective Model Pruning: Measure The Redundancy of Model Components
Yixuan Wang
Dan Guralnik
Saiedeh Akbari
Warren E. Dixon
78
0
0
30 Sep 2025
TTT3R: 3D Reconstruction as Test-Time Training
TTT3R: 3D Reconstruction as Test-Time Training
Xingyu Chen
Yue Chen
Yuliang Xiu
Andreas Geiger
Anpei Chen
3DV
311
22
0
30 Sep 2025
SparseDoctor: Towards Efficient Chat Doctor with Mixture of Experts Enhanced Large Language Models
SparseDoctor: Towards Efficient Chat Doctor with Mixture of Experts Enhanced Large Language Models
Jianbin Zhang
Yulin Zhu
Wai Lun Lo
Richard Tai-Chiu Hsung
Harris Sik-Ho Tsang
Kai Zhou
MoELM&MA
211
0
0
15 Sep 2025
Unveiling Super Experts in Mixture-of-Experts Large Language Models
Unveiling Super Experts in Mixture-of-Experts Large Language Models
Zunhai Su
Qingyuan Li
Hao Zhang
Weihao Ye
Qibo Xue
YuLei Qian
Yuchen Xie
Ngai Wong
Kehong Yuan
MoE
296
5
0
31 Jul 2025
On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective
On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective
Gabriel Mongaras
Eric C. Larson
123
4
0
31 Jul 2025
GMSA: Enhancing Context Compression via Group Merging and Layer Semantic Alignment
GMSA: Enhancing Context Compression via Group Merging and Layer Semantic Alignment
Jiwei Tang
Zhicheng Zhang
Shunlong Wu
Jingheng Ye
Lichen Bai
...
Tingwei Lu
Jiaqi Chen
Lin Hai
Hai-Tao Zheng
Hong-Gee Kim
282
11
0
18 May 2025
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
Piotr Piekos
Róbert Csordás
Jürgen Schmidhuber
MoEVLM
609
8
0
01 May 2025
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
Zayd Muhammad Kawakibi Zuhri
Erland Hilman Fuadi
Alham Fikri Aji
299
7
0
29 Apr 2025
Numerical Error Analysis of Large Language Models
Stanislav Budzinskiy
Wenyi Fang
Longbin Zeng
Philipp Petersen
240
5
0
13 Mar 2025
LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation
LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration DistillationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Zican Dong
Junyi Li
Jinhao Jiang
Mingyu Xu
Wayne Xin Zhao
Bin Wang
Xin Wu
VLM
844
10
0
11 Feb 2025
Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Zihan Qiu
Zeyu Huang
Jian Xu
Kaiyue Wen
Zhaoxiang Wang
Rui Men
Ivan Titov
Dayiheng Liu
Jingren Zhou
Junyang Lin
MoE
453
31
0
21 Jan 2025
MiniMax-01: Scaling Foundation Models with Lightning Attention
MiniMax-01: Scaling Foundation Models with Lightning Attention
MiniMax
Aonian Li
Bangwei Gong
Bo Yang
Bo Shen
...
Zhan Qin
Zhenhua Fan
Zhihang Yu
Z. L. Jiang
Zijia Wu
MoE
401
112
0
14 Jan 2025
MaskControl: Spatio-Temporal Control for Masked Motion Synthesis
MaskControl: Spatio-Temporal Control for Masked Motion Synthesis
Ekkasit Pinyoanuntapong
Muhammad Usama Saleem
Korrawe Karunratanakul
Pu Wang
Hongfei Xue
Chong Chen
Chuan Guo
Junli Cao
J. Ren
Sergey Tulyakov
VGen
504
101
0
14 Oct 2024
Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads
Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads
Ali Khaleghi Rahimian
Manish Kumar Govind
Subhajit Maity
Dominick Reilly
Christian Kummerle
Srijan Das
A. Dutta
248
1
0
27 Jun 2024
1
Page 1 of 1