ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1904.10509
  4. Cited By
Generating Long Sequences with Sparse Transformers

Generating Long Sequences with Sparse Transformers

23 April 2019
R. Child
Scott Gray
Alec Radford
Ilya Sutskever
ArXiv (abs)PDFHTML

Papers citing "Generating Long Sequences with Sparse Transformers"

50 / 1,282 papers shown
PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation
PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation
Xiaolong Li
Youping Gu
Xi Lin
Weijie Wang
Bohan Zhuang
124
0
0
03 Dec 2025
Nexus: Higher-Order Attention Mechanisms in Transformers
Nexus: Higher-Order Attention Mechanisms in Transformers
Hanting Chen
Chong Zhu
Kai Han
Yuchuan Tian
Yuchen Liang
Tianyu Guo
Xinghao Chen
Dacheng Tao
Yunhe Wang
311
0
0
03 Dec 2025
HTTM: Head-wise Temporal Token Merging for Faster VGGT
HTTM: Head-wise Temporal Token Merging for Faster VGGT
Weitian Wang
Lukas Meiner
Rai Shubham
Cecilia De La Parra
Akash Kumar
167
0
0
26 Nov 2025
Length-MAX Tokenizer for Language Models
Length-MAX Tokenizer for Language Models
Dong Dong
Weijie Su
VLM
199
0
0
25 Nov 2025
SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space
SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space
Zhenyi Shen
Junru Lu
Lin Gui
Jiazheng Li
Yulan He
D. Yin
Xing Sun
319
0
0
25 Nov 2025
Re-Key-Free, Risky-Free: Adaptable Model Usage Control
Re-Key-Free, Risky-Free: Adaptable Model Usage Control
Zihan Wang
Zhongkui Ma
Xinguo Feng
Chuan Yan
Dongge Liu
Ruoxi Sun
Derui Wang
Minhui Xue
Guangdong Bai
AAML
166
0
0
24 Nov 2025
Rethinking Vision Transformer Depth via Structural Reparameterization
Rethinking Vision Transformer Depth via Structural Reparameterization
Chengwei Zhou
Vipin Chaudhary
Gourav Datta
ViT
112
0
0
24 Nov 2025
DeepCoT: Deep Continual Transformers for Real-Time Inference on Data Streams
DeepCoT: Deep Continual Transformers for Real-Time Inference on Data Streams
Ginés Carreto Picón
Peng Yuan Zhou
Qi Zhang
Alexandros Iosifidis
AI4TS
196
0
0
21 Nov 2025
Joint Semantic-Channel Coding and Modulation for Token Communications
Joint Semantic-Channel Coding and Modulation for Token Communications
Jingkai Ying
Zhijin Qin
Yulong Feng
Liejun Wang
Xiaoming Tao
71
0
0
19 Nov 2025
Attention Via Convolutional Nearest Neighbors
Attention Via Convolutional Nearest Neighbors
Mingi Kang
Jeová Farias Sales Rocha Neto
197
1
0
18 Nov 2025
QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention
QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention
Hyunwoo Oh
Hanning Chen
Sanggeon Yun
Yang Ni
Wenjun Huang
Tamoghno Das
Suyeon Jang
Mohsen Imani
VLM
162
0
0
17 Nov 2025
TIMERIPPLE: Accelerating vDiTs by Understanding the Spatio-Temporal Correlations in Latent Space
TIMERIPPLE: Accelerating vDiTs by Understanding the Spatio-Temporal Correlations in Latent Space
Wenxuan Miao
Yulin Sun
Aiyue Chen
Jing Lin
Yiwu Yao
Yiming Gan
Jieru Zhao
Jingwen Leng
Mingyi Guo
Yu Feng
189
0
0
15 Nov 2025
KVSwap: Disk-aware KV Cache Offloading for Long-Context On-device Inference
KVSwap: Disk-aware KV Cache Offloading for Long-Context On-device Inference
H. Zhang
Chunwei Xia
Zheng Wang
SyDa
349
1
0
14 Nov 2025
Galactification: painting galaxies onto dark matter only simulations using a transformer-based model
Galactification: painting galaxies onto dark matter only simulations using a transformer-based model
Shivam Pandey
Christopher C. Lovell
Chirag Modi
Benjamin Dan Wandelt
3DGS
108
0
0
11 Nov 2025
Learning to Focus: Focal Attention for Selective and Scalable Transformers
Learning to Focus: Focal Attention for Selective and Scalable Transformers
Dhananjay Ram
Wei Xia
Stefano Soatto
288
0
0
10 Nov 2025
CG-TTRL: Context-Guided Test-Time Reinforcement Learning for On-Device Large Language Models
CG-TTRL: Context-Guided Test-Time Reinforcement Learning for On-Device Large Language Models
Peyman Hosseini
Ondrej Bohdal
Taha Ceritli
Ignacio Castro
Matthew Purver
Mete Ozay
Umberto Michieli
OffRL
136
0
0
09 Nov 2025
How Particle-System Random Batch Methods Enhance Graph Transformer: Memory Efficiency and Parallel Computing Strategy
How Particle-System Random Batch Methods Enhance Graph Transformer: Memory Efficiency and Parallel Computing Strategy
Hanwen Liu
Yixuan Ma
Shi Jin
Yuguang Wang
105
0
0
08 Nov 2025
Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving
Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving
Hui Zeng
Daming Zhao
Pengfei Yang
WenXuan Hou
Tianyang Zheng
Hui Li
Weiye Ji
Jidong Zhai
217
1
0
08 Nov 2025
BudgetMem: Learning Selective Memory Policies for Cost-Efficient Long-Context Processing in Language Models
BudgetMem: Learning Selective Memory Policies for Cost-Efficient Long-Context Processing in Language Models
Chandra Vamsi Krishna Alla
Harish Naidu Gaddam
Manohar Kommi
RALM
285
0
0
07 Nov 2025
Attention and Compression is all you need for Controllably Efficient Language Models
Attention and Compression is all you need for Controllably Efficient Language Models
Jatin Prakash
N. Jethani
Rajesh Ranganath
MQVLM
467
0
0
07 Nov 2025
Neural Beamforming with Doppler-Aware Sparse Attention for High Mobility Environments
Neural Beamforming with Doppler-Aware Sparse Attention for High Mobility Environments
Cemil Vahapoglu
Timothy J. O'Shea
Wan Liu
S. Ulukus
141
0
0
05 Nov 2025
AILA--First Experiments with Localist Language Models
AILA--First Experiments with Localist Language Models
Joachim Diederich
52
1
0
05 Nov 2025
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators
Jonathan Li
Nasim Farahini
Evgenii Iuliugin
Magnus Vesterlund
Christian Haggstrom
...
Mingran Wang
Qinghua Li
Bo Li
Urmish Thakker
R. Prabhakar
VLM
348
0
0
05 Nov 2025
SALS: Sparse Attention in Latent Space for KV cache Compression
SALS: Sparse Attention in Latent Space for KV cache Compression
Junlin Mu
Hantao Huang
J. Zhang
Minghui Yu
Tao Wang
Yidong Li
83
0
0
28 Oct 2025
Large language model-based task planning for service robots: A review
Large language model-based task planning for service robots: A review
Shaohan Bian
Ying Zhang
Guohui Tian
Zhiqiang Miao
Edmond Q. Wu
Simon X. Yang
C. Hua
LLMAGLM&Ro
204
0
0
27 Oct 2025
Transformers from Compressed Representations
Transformers from Compressed Representations
Juan Carlos León Alcázar
Mattia Soldan
Mohammad Saatialsoruji
Alejandro Pardo
Hani Itani
Juan C. Pérez
Bernard Ghanem
136
0
0
26 Oct 2025
Gradual Forgetting: Logarithmic Compression for Extending Transformer Context Windows
Gradual Forgetting: Logarithmic Compression for Extending Transformer Context Windows
Billy Dickson
Zoran Tiganj
CLL
124
1
0
25 Oct 2025
Stateful KV Cache Management for LLMs: Balancing Space, Time, Accuracy, and Positional Fidelity
Stateful KV Cache Management for LLMs: Balancing Space, Time, Accuracy, and Positional Fidelity
Pratik Poudel
KELM
153
0
0
23 Oct 2025
Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction
Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction
Mutian He
Philip N. Garner
CLL
258
0
0
23 Oct 2025
GPTFace: Generative Pre-training of Facial-Linguistic Transformer by Span Masking and Weakly Correlated Text-image Data
GPTFace: Generative Pre-training of Facial-Linguistic Transformer by Span Masking and Weakly Correlated Text-image Data
Yudong Li
Hao Li
Xianxu Hou
Linlin Shen
125
0
0
21 Oct 2025
Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models
Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models
Jiaqi Leng
Xiang Hu
Junxiong Wang
Jianguo Li
Wei Wu
Yucheng Lu
122
1
0
20 Oct 2025
Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads
Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads
Zhoutong Wu
Y. Zhang
Yiming Dong
Chenheng Zhang
Cong Fang
Kun Yuan
Zhouchen Lin
153
0
0
19 Oct 2025
FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution
FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution
Syed Rifat Raiyan
Md Farhan Ishmam
Abdullah Al Imran
Mohammad Ali Moni
155
0
0
18 Oct 2025
Stability of Transformers under Layer Normalization
Stability of Transformers under Layer Normalization
Kelvin Kan
Xingjian Li
Benjamin J. Zhang
Tuhin Sahai
Stanley Osher
Krishna Kumar
Markos A. Katsoulakis
109
1
0
10 Oct 2025
DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning
DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning
Hossein Entezari Zarch
Lei Gao
Chaoyi Jiang
Murali Annavarm
LRM
85
0
0
10 Oct 2025
Learning What to Remember: Adaptive Probabilistic Memory Retention for Memory-Efficient Language Models
Learning What to Remember: Adaptive Probabilistic Memory Retention for Memory-Efficient Language Models
S M Rafiuddin
Muntaha Nujat Khan
RALMKELM
142
0
0
09 Oct 2025
Artificial Hippocampus Networks for Efficient Long-Context Modeling
Artificial Hippocampus Networks for Efficient Long-Context Modeling
Yunhao Fang
Weihao Yu
Shu Zhong
Qinghao Ye
Xuehan Xiong
Lai Wei
145
2
0
08 Oct 2025
Vectorized FlashAttention with Low-cost Exponential Computation in RISC-V Vector Processors
Vectorized FlashAttention with Low-cost Exponential Computation in RISC-V Vector Processors
Vasileios Titopoulos
K. Alexandridis
G. Dimitrakopoulos
111
0
0
08 Oct 2025
The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures
The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures
Alexander Fichtl
Jeremias Bohn
Josefin Kelber
Edoardo Mosca
Georg Groh
131
0
0
06 Oct 2025
Stratum: System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving
Stratum: System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving
Yue Pan
Zihan Xia
Po-Kai Hsu
Lanxiang Hu
Hyungyo Kim
...
Minxuan Zhou
Nam Sung Kim
Shimeng Yu
Tajana Rosing
Mingu Kang
MoE
115
1
0
06 Oct 2025
Emergent Coordination in Multi-Agent Language Models
Emergent Coordination in Multi-Agent Language Models
Christoph Riedl
LLMAG
107
1
0
05 Oct 2025
Towards Sampling Data Structures for Tensor Products in Turnstile Streams
Towards Sampling Data Structures for Tensor Products in Turnstile Streams
Zhao Song
Shenghao Xie
Samson Zhou
143
0
0
04 Oct 2025
Accelerating Attention with Basis Decomposition
Accelerating Attention with Basis Decomposition
Jialin Zhao
154
0
0
02 Oct 2025
Input-Aware Sparse Attention for Real-Time Co-Speech Video Generation
Input-Aware Sparse Attention for Real-Time Co-Speech Video Generation
Beijia Lu
Ziyi Chen
Jing Xiao
Jun-Yan Zhu
DiffMVGen
326
0
0
02 Oct 2025
SAGE-Music: Low-Latency Symbolic Music Generation via Attribute-Specialized Key-Value Head Sharing
SAGE-Music: Low-Latency Symbolic Music Generation via Attribute-Specialized Key-Value Head Sharing
Jiaye Tan
Haonan Luo
Linfeng Song
Shuaiqi Chen
Yishan Lyu
...
Haoran Zhang
Jiaming Bai
Haoran Cheng
Q. Vera Liao
Hao-Wen Dong
181
0
0
01 Oct 2025
TASP: Topology-aware Sequence Parallelism
TASP: Topology-aware Sequence Parallelism
Y. Wang
Ke Hong
Xiuhong Li
Yuanchao Xu
Wenxun Wang
Guohao Dai
Y. Wang
156
0
0
30 Sep 2025
HilbertA: Hilbert Attention for Image Generation with Diffusion Models
HilbertA: Hilbert Attention for Image Generation with Diffusion Models
Shaoyi Zheng
Wenbo Lu
Yuxuan Xia
Haomin Liu
Shengjie Wang
93
0
0
30 Sep 2025
InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation
InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation
Weilin Zhao
Z. Zhou
Zhou Su
Chaojun Xiao
Yuxuan Li
...
Ruoyao Xiao
Yuxiang Huang
Ao Sun
Xu Han
Zhiyuan Liu
VLM
163
5
0
29 Sep 2025
FlashOmni: A Unified Sparse Attention Engine for Diffusion Transformers
FlashOmni: A Unified Sparse Attention Engine for Diffusion Transformers
Liang Qiao
Yue Dai
Y. Huang
Hongyu Kan
Jun Shi
Hong An
144
0
0
29 Sep 2025
Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents
Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents
Yaorui Shi
Yuxin Chen
Siyuan Wang
Sihang Li
Hengxing Cai
Qi Gu
Xiang-Bin Wang
An Zhang
LLMAGKELMRALMOffRLCLLLRM
134
2
0
27 Sep 2025
1234...242526
Next