ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2510.21956
175
0

Transformer Based Linear Attention with Optimized GPU Kernel Implementation

24 October 2025
Armin Gerami
R. Duraiswami
ArXiv (abs)PDFHTMLGithub (12875★)
Main:12 Pages
5 Figures
Bibliography:3 Pages
3 Tables
Appendix:4 Pages
Abstract

The original softmax-based attention mechanism (regular attention) in the extremely successful Transformer architecture computes attention between NNN tokens, each embedded in a DDD-dimensional head, with a time complexity of O(N2D)O(N^2D)O(N2D). Given the success of Transformers, improving their runtime during both training and inference is a popular research area. One such approach is the introduction of the linear attention (LA) mechanisms, which offers a linear time complexity of O(ND2)O(ND^2)O(ND2) and have demonstrated comparable accuracy to regular attention. However, LA in practice lags behind its theoretical efficiency. We propose a novel method for LA's forward and backward passes, along with a highly-optimized CUDA implementation. Our approach outperforms the state-of-the-art by 3.3 times in speed and reduces memory consumption by 3.6 times. We validate these improvements in both single-layer and end-to-end settings by training a 1.4 billion parameter language model, which demonstrates similar expressivity to regular attention on major reasoning benchmarks.

View on arXiv
Comments on this paper