ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1904.10509
  4. Cited By
Generating Long Sequences with Sparse Transformers

Generating Long Sequences with Sparse Transformers

23 April 2019
R. Child
Scott Gray
Alec Radford
Ilya Sutskever
ArXiv (abs)PDFHTML

Papers citing "Generating Long Sequences with Sparse Transformers"

50 / 1,283 papers shown
Lag-Relative Sparse Attention In Long Context Training
Lag-Relative Sparse Attention In Long Context Training
Manlai Liang
Wanyi Huang
Mandi Liu
Huaijun Li
Jinlong Li
RALM
197
0
0
13 Jun 2025
On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention
On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention
Yeonju Ro
Zhenyu Zhang
Souvik Kundu
Zhangyang Wang
Aditya Akella
430
2
0
11 Jun 2025
SeerAttention-R: Sparse Attention Adaptation for Long Reasoning
Yizhao Gao
Shuming Guo
Shijie Cao
Yuqing Xia
Yu Cheng
...
Hayden Kwok-Hay So
Yu Hua
Ting Cao
Fan Yang
Mao Yang
VLMLRM
228
9
0
10 Jun 2025
AstroCompress: A benchmark dataset for multi-purpose compression of astronomical dataInternational Conference on Learning Representations (ICLR), 2025
Tuan Truong
Rithwik Sudharsan
Jianlong Wu
Peter Xiangyuan Ma
Ruihan Yang
Stephan Mandt
Joshua S. Bloom
179
0
0
10 Jun 2025
Spark Transformer: Reactivating Sparsity in FFN and Attention
Spark Transformer: Reactivating Sparsity in FFN and Attention
Chong You
Kan Wu
Zhipeng Jia
Lin Chen
Srinadh Bhojanapalli
...
Felix X. Yu
Prateek Jain
David Culler
Henry M. Levy
Sanjiv Kumar
239
2
0
07 Jun 2025
MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference
MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context InferenceAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Kunxi Li
Zhonghua Jiang
Zhouzhou Shen
Zhaode Wang
Chengfei Lv
Shengyu Zhang
Fan Wu
Fei Wu
VLM
213
2
0
06 Jun 2025
DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration
DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference AccelerationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Hanzhi Zhang
Heng Fan
Kewei Sha
Yan Huang
Yunhe Feng
191
2
0
06 Jun 2025
Kinetics: Rethinking Test-Time Scaling Laws
Kinetics: Rethinking Test-Time Scaling Laws
Ranajoy Sadhukhan
Zhuoming Chen
Haizhong Zheng
Yang Zhou
Emma Strubell
Beidi Chen
461
6
0
05 Jun 2025
Beyond Text Compression: Evaluating Tokenizers Across Scales
Beyond Text Compression: Evaluating Tokenizers Across ScalesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Jonas F. Lotz
António V. Lopes
Stephan Peitz
Hendra Setiawan
Leonardo Emili
280
3
0
03 Jun 2025
COGNATE: Acceleration of Sparse Tensor Programs on Emerging Hardware using Transfer Learning
COGNATE: Acceleration of Sparse Tensor Programs on Emerging Hardware using Transfer Learning
Chamika Sudusinghe
Gerasimos Gerogiannis Damitha Lenadora
Damitha Sandeepa Lenadora
Charles Block
Josep Torrellas
Charith Mendis
316
1
0
31 May 2025
SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling
SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling
Xiaodong Ji
Hailin Zhang
Fangcheng Fu
Huang Leng
208
1
0
30 May 2025
INSIGHT: A Survey of In-Network Systems for Intelligent, High-Efficiency AI and Topology Optimization
INSIGHT: A Survey of In-Network Systems for Intelligent, High-Efficiency AI and Topology Optimization
Aleksandr Algazinov
Joydeep Chandra
Matt Laing
141
0
0
30 May 2025
Transformers Are Universally Consistent
Transformers Are Universally Consistent
Sagar Ghosh
Kushal Bose
Swagatam Das
147
0
0
30 May 2025
KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction
KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction
Jang-Hyun Kim
Jinuk Kim
S. Kwon
Jae W. Lee
Sangdoo Yun
Hyun Oh Song
MQVLM
350
13
0
29 May 2025
AnchorAttention: Difference-Aware Sparse Attention with Stripe Granularity
AnchorAttention: Difference-Aware Sparse Attention with Stripe Granularity
Yu Zhang
Dong Guo
Fang Wu
Guoliang Zhu
Dian Ding
Yiming Zhang
260
1
0
29 May 2025
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
Ce Zhang
Kaixin Ma
Tianqing Fang
Wenhao Yu
Hongming Zhang
Zhisong Zhang
Yaqi Xie
Katia Sycara
VLM
313
7
0
28 May 2025
Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape
Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape
Ruichen Chen
Keith G. Mills
Liyao Jiang
Chao Gao
Di Niu
VGen
414
1
0
28 May 2025
ReassembleNet: Learnable Keypoints and Diffusion for 2D Fresco Reconstruction
ReassembleNet: Learnable Keypoints and Diffusion for 2D Fresco Reconstruction
Adeela Islam
Stefano Fiorini
Stuart James
Pietro Morerio
Alessio Del Bue
DiffM
1.3K
2
0
27 May 2025
Vision Transformers with Self-Distilled Registers
Vision Transformers with Self-Distilled Registers
Yinjie Chen
Zipeng Yan
Chong Zhou
Bo Dai
Andrew F. Luo
473
4
0
27 May 2025
Continuous-Time Attention: PDE-Guided Mechanisms for Long-Sequence Transformers
Continuous-Time Attention: PDE-Guided Mechanisms for Long-Sequence Transformers
Yukun Zhang
Xueqing Zhou
AI4TS
175
1
0
27 May 2025
CA3D: Convolutional-Attentional 3D Nets for Efficient Video Activity Recognition on the Edge
CA3D: Convolutional-Attentional 3D Nets for Efficient Video Activity Recognition on the Edge
Gabriele Lagani
Fabrizio Falchi
Claudio Gennaro
Giuseppe Amato
169
1
0
26 May 2025
MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models
MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Zhongzhan Huang
Guoming Ling
Shanshan Zhong
Hefeng Wu
Liang Lin
292
0
0
26 May 2025
How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation
How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation
Xin Lu
Yanyan Zhao
Si Wei
Shijin Wang
Bing Qin
Ting Liu
217
0
0
24 May 2025
MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention
MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention
Can Yaras
Alec S. Xu
Pierre Abillama
Changwoo Lee
Laura Balzano
255
1
0
24 May 2025
Why Do Some Inputs Break Low-Bit LLM Quantization?
Why Do Some Inputs Break Low-Bit LLM Quantization?
Ting-Yun Chang
Muru Zhang
Jesse Thomason
Robin Jia
MQ
275
1
0
24 May 2025
L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models
L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models
Xiaohao Liu
Xiaobo Xia
Weixiang Zhao
Manyi Zhang
Xianzhi Yu
Xiu Su
Shuo Yang
See-Kiong Ng
Tat-Seng Chua
KELMLRM
419
5
0
23 May 2025
Structured Linear CDEs: Maximally Expressive and Parallel-in-Time Sequence Models
Structured Linear CDEs: Maximally Expressive and Parallel-in-Time Sequence Models
Benjamin Walker
Lingyi Yang
Nicola Muca Cirone
C. Salvi
Terry Lyons
AI4TS
379
6
0
23 May 2025
Only Large Weights (And Not Skip Connections) Can Prevent the Perils of Rank Collapse
Only Large Weights (And Not Skip Connections) Can Prevent the Perils of Rank Collapse
Josh Alman
Zhao Song
374
10
0
22 May 2025
Hallucinate at the Last in Long Response Generation: A Case Study on Long Document Summarization
Hallucinate at the Last in Long Response Generation: A Case Study on Long Document Summarization
Joonho Yang
Seunghyun Yoon
Hwan Chang
Byeongjeong Kim
Hwanhee Lee
HILM
521
2
0
21 May 2025
Low-Cost FlashAttention with Fused Exponential and Multiplication Hardware Operators
Low-Cost FlashAttention with Fused Exponential and Multiplication Hardware OperatorsIEEE Computer Society Annual Symposium on VLSI (VLSI), 2025
K. Alexandridis
Vasileios Titopoulos
G. Dimitrakopoulos
288
3
0
20 May 2025
FLASH-D: FlashAttention with Hidden Softmax Division
FLASH-D: FlashAttention with Hidden Softmax Division
K. Alexandridis
Vasileios Titopoulos
G. Dimitrakopoulos
271
1
0
20 May 2025
Fast RoPE Attention: Combining the Polynomial Method and Fast Fourier Transform
Fast RoPE Attention: Combining the Polynomial Method and Fast Fourier Transform
Josh Alman
Zhao Song
361
23
0
17 May 2025
Optimal Control for Transformer Architectures: Enhancing Generalization, Robustness and Efficiency
Optimal Control for Transformer Architectures: Enhancing Generalization, Robustness and Efficiency
Kelvin Kan
Xingjian Li
Benjamin J. Zhang
Tuhin Sahai
Stanley Osher
Markos A. Katsoulakis
256
0
0
16 May 2025
MID-L: Matrix-Interpolated Dropout Layer with Layer-wise Neuron Selection
MID-L: Matrix-Interpolated Dropout Layer with Layer-wise Neuron Selection
Pouya Shaeri
Ariane Middel
75
2
0
16 May 2025
ComplexFormer: Disruptively Advancing Transformer Inference Ability via Head-Specific Complex Vector Attention
ComplexFormer: Disruptively Advancing Transformer Inference Ability via Head-Specific Complex Vector Attention
Jintian Shao
Hongyi Huang
Hongyi Huang
Beiwen Zhang
ZhiYu Wu
You Shan
MingKai Zheng
331
0
0
15 May 2025
FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs
FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs
Pencuo Zeren
Qiuming Luo
Rui Mao
Chang Kong
228
3
0
13 May 2025
Lost in Transmission: When and Why LLMs Fail to Reason Globally
Lost in Transmission: When and Why LLMs Fail to Reason Globally
Tobias Schnabel
Kiran Tomlinson
Adith Swaminathan
Jennifer Neville
LRM
690
2
0
13 May 2025
Learning Advanced Self-Attention for Linear Transformers in the Singular Value Domain
Learning Advanced Self-Attention for Linear Transformers in the Singular Value DomainInternational Joint Conference on Artificial Intelligence (IJCAI), 2025
Hyowon Wi
Jeongwhan Choi
Noseong Park
351
2
0
13 May 2025
Fused3S: Fast Sparse Attention on Tensor Cores
Fused3S: Fast Sparse Attention on Tensor CoresInternational Conference on Supercomputing (ICS), 2025
Zitong Li
Aparna Chandramowlishwaran
GNN
207
0
0
12 May 2025
A Split-then-Join Approach to Abstractive Summarization for Very Long Documents in a Low Resource Setting
A Split-then-Join Approach to Abstractive Summarization for Very Long Documents in a Low Resource Setting
Lhuqita Fazry
VLM
447
0
0
11 May 2025
Graph Laplacian Wavelet Transformer via Learnable Spectral Decomposition
Graph Laplacian Wavelet Transformer via Learnable Spectral Decomposition
Andrew Kiruluta
Eric Lundy
Priscilla Burity
206
1
0
09 May 2025
Small Clips, Big Gains: Learning Long-Range Refocused Temporal Information for Video Super-Resolution
Small Clips, Big Gains: Learning Long-Range Refocused Temporal Information for Video Super-Resolution
Xingyu Zhou
Wei Long
Jingbo Lu
Shiyin Jiang
Weiyi You
Haifeng Wu
Shuhang Gu
278
0
0
04 May 2025
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
Piotr Piekos
Róbert Csordás
Jürgen Schmidhuber
MoEVLM
590
7
0
01 May 2025
Polysemy of Synthetic Neurons Towards a New Type of Explanatory Categorical Vector Spaces
Polysemy of Synthetic Neurons Towards a New Type of Explanatory Categorical Vector Spaces
Michael Pichat
William Pogrund
Paloma Pichat
Judicael Poumay
Armanouche Gasparian
Samuel Demarchi
Martin Corbet
Alois Georgeon
Michael Veillet-Guillem
MILM
296
0
0
30 Apr 2025
From Attention to Atoms: Spectral Dictionary Learning for Fast, Interpretable Language Models
From Attention to Atoms: Spectral Dictionary Learning for Fast, Interpretable Language Models
Andrew Kiruluta
233
1
0
29 Apr 2025
Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity
Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity
Ruifeng Ren
Yong Liu
970
1
0
26 Apr 2025
The Rise of Small Language Models in Healthcare: A Comprehensive Survey
The Rise of Small Language Models in Healthcare: A Comprehensive Survey
Muskan Garg
Shaina Raza
Shebuti Rayana
Xingyi Liu
Sunghwan Sohn
LM&MAAILaw
494
13
0
23 Apr 2025
Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light
Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light
Ali Hassani
Fengzhe Zhou
Aditya Kane
Jiannan Huang
Chieh-Yun Chen
...
Bing Xu
Haicheng Wu
Wen-mei W. Hwu
Xuan Li
Humphrey Shi
202
10
0
23 Apr 2025
Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access
Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access
Xiang Hu
Jiaqi Leng
Jun Zhao
Kewei Tu
Wei Wu
Mamba
440
1
0
23 Apr 2025
MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention
MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention
Yucheng Li
Huiqiang Jiang
Chengruidong Zhang
Qianhui Wu
Xufang Luo
...
Amir H. Abdi
Dongsheng Li
Jianfeng Gao
Yue Yang
Lili Qiu
357
17
0
22 Apr 2025
Previous
123456...242526
Next
Page 3 of 26
Pageof 26