Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1904.10509
Cited By
Generating Long Sequences with Sparse Transformers
23 April 2019
R. Child
Scott Gray
Alec Radford
Ilya Sutskever
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Generating Long Sequences with Sparse Transformers"
50 / 1,283 papers shown
Lag-Relative Sparse Attention In Long Context Training
Manlai Liang
Wanyi Huang
Mandi Liu
Huaijun Li
Jinlong Li
RALM
197
0
0
13 Jun 2025
On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention
Yeonju Ro
Zhenyu Zhang
Souvik Kundu
Zhangyang Wang
Aditya Akella
430
2
0
11 Jun 2025
SeerAttention-R: Sparse Attention Adaptation for Long Reasoning
Yizhao Gao
Shuming Guo
Shijie Cao
Yuqing Xia
Yu Cheng
...
Hayden Kwok-Hay So
Yu Hua
Ting Cao
Fan Yang
Mao Yang
VLM
LRM
228
9
0
10 Jun 2025
AstroCompress: A benchmark dataset for multi-purpose compression of astronomical data
International Conference on Learning Representations (ICLR), 2025
Tuan Truong
Rithwik Sudharsan
Jianlong Wu
Peter Xiangyuan Ma
Ruihan Yang
Stephan Mandt
Joshua S. Bloom
179
0
0
10 Jun 2025
Spark Transformer: Reactivating Sparsity in FFN and Attention
Chong You
Kan Wu
Zhipeng Jia
Lin Chen
Srinadh Bhojanapalli
...
Felix X. Yu
Prateek Jain
David Culler
Henry M. Levy
Sanjiv Kumar
239
2
0
07 Jun 2025
MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Kunxi Li
Zhonghua Jiang
Zhouzhou Shen
Zhaode Wang
Chengfei Lv
Shengyu Zhang
Fan Wu
Fei Wu
VLM
213
2
0
06 Jun 2025
DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Hanzhi Zhang
Heng Fan
Kewei Sha
Yan Huang
Yunhe Feng
191
2
0
06 Jun 2025
Kinetics: Rethinking Test-Time Scaling Laws
Ranajoy Sadhukhan
Zhuoming Chen
Haizhong Zheng
Yang Zhou
Emma Strubell
Beidi Chen
461
6
0
05 Jun 2025
Beyond Text Compression: Evaluating Tokenizers Across Scales
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Jonas F. Lotz
António V. Lopes
Stephan Peitz
Hendra Setiawan
Leonardo Emili
280
3
0
03 Jun 2025
COGNATE: Acceleration of Sparse Tensor Programs on Emerging Hardware using Transfer Learning
Chamika Sudusinghe
Gerasimos Gerogiannis Damitha Lenadora
Damitha Sandeepa Lenadora
Charles Block
Josep Torrellas
Charith Mendis
316
1
0
31 May 2025
SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling
Xiaodong Ji
Hailin Zhang
Fangcheng Fu
Huang Leng
208
1
0
30 May 2025
INSIGHT: A Survey of In-Network Systems for Intelligent, High-Efficiency AI and Topology Optimization
Aleksandr Algazinov
Joydeep Chandra
Matt Laing
141
0
0
30 May 2025
Transformers Are Universally Consistent
Sagar Ghosh
Kushal Bose
Swagatam Das
147
0
0
30 May 2025
KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction
Jang-Hyun Kim
Jinuk Kim
S. Kwon
Jae W. Lee
Sangdoo Yun
Hyun Oh Song
MQ
VLM
350
13
0
29 May 2025
AnchorAttention: Difference-Aware Sparse Attention with Stripe Granularity
Yu Zhang
Dong Guo
Fang Wu
Guoliang Zhu
Dian Ding
Yiming Zhang
260
1
0
29 May 2025
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
Ce Zhang
Kaixin Ma
Tianqing Fang
Wenhao Yu
Hongming Zhang
Zhisong Zhang
Yaqi Xie
Katia Sycara
VLM
313
7
0
28 May 2025
Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape
Ruichen Chen
Keith G. Mills
Liyao Jiang
Chao Gao
Di Niu
VGen
414
1
0
28 May 2025
ReassembleNet: Learnable Keypoints and Diffusion for 2D Fresco Reconstruction
Adeela Islam
Stefano Fiorini
Stuart James
Pietro Morerio
Alessio Del Bue
DiffM
1.3K
2
0
27 May 2025
Vision Transformers with Self-Distilled Registers
Yinjie Chen
Zipeng Yan
Chong Zhou
Bo Dai
Andrew F. Luo
473
4
0
27 May 2025
Continuous-Time Attention: PDE-Guided Mechanisms for Long-Sequence Transformers
Yukun Zhang
Xueqing Zhou
AI4TS
175
1
0
27 May 2025
CA3D: Convolutional-Attentional 3D Nets for Efficient Video Activity Recognition on the Edge
Gabriele Lagani
Fabrizio Falchi
Claudio Gennaro
Giuseppe Amato
169
1
0
26 May 2025
MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Zhongzhan Huang
Guoming Ling
Shanshan Zhong
Hefeng Wu
Liang Lin
292
0
0
26 May 2025
How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation
Xin Lu
Yanyan Zhao
Si Wei
Shijin Wang
Bing Qin
Ting Liu
217
0
0
24 May 2025
MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention
Can Yaras
Alec S. Xu
Pierre Abillama
Changwoo Lee
Laura Balzano
255
1
0
24 May 2025
Why Do Some Inputs Break Low-Bit LLM Quantization?
Ting-Yun Chang
Muru Zhang
Jesse Thomason
Robin Jia
MQ
275
1
0
24 May 2025
L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models
Xiaohao Liu
Xiaobo Xia
Weixiang Zhao
Manyi Zhang
Xianzhi Yu
Xiu Su
Shuo Yang
See-Kiong Ng
Tat-Seng Chua
KELM
LRM
419
5
0
23 May 2025
Structured Linear CDEs: Maximally Expressive and Parallel-in-Time Sequence Models
Benjamin Walker
Lingyi Yang
Nicola Muca Cirone
C. Salvi
Terry Lyons
AI4TS
379
6
0
23 May 2025
Only Large Weights (And Not Skip Connections) Can Prevent the Perils of Rank Collapse
Josh Alman
Zhao Song
374
10
0
22 May 2025
Hallucinate at the Last in Long Response Generation: A Case Study on Long Document Summarization
Joonho Yang
Seunghyun Yoon
Hwan Chang
Byeongjeong Kim
Hwanhee Lee
HILM
521
2
0
21 May 2025
Low-Cost FlashAttention with Fused Exponential and Multiplication Hardware Operators
IEEE Computer Society Annual Symposium on VLSI (VLSI), 2025
K. Alexandridis
Vasileios Titopoulos
G. Dimitrakopoulos
288
3
0
20 May 2025
FLASH-D: FlashAttention with Hidden Softmax Division
K. Alexandridis
Vasileios Titopoulos
G. Dimitrakopoulos
271
1
0
20 May 2025
Fast RoPE Attention: Combining the Polynomial Method and Fast Fourier Transform
Josh Alman
Zhao Song
361
23
0
17 May 2025
Optimal Control for Transformer Architectures: Enhancing Generalization, Robustness and Efficiency
Kelvin Kan
Xingjian Li
Benjamin J. Zhang
Tuhin Sahai
Stanley Osher
Markos A. Katsoulakis
256
0
0
16 May 2025
MID-L: Matrix-Interpolated Dropout Layer with Layer-wise Neuron Selection
Pouya Shaeri
Ariane Middel
75
2
0
16 May 2025
ComplexFormer: Disruptively Advancing Transformer Inference Ability via Head-Specific Complex Vector Attention
Jintian Shao
Hongyi Huang
Hongyi Huang
Beiwen Zhang
ZhiYu Wu
You Shan
MingKai Zheng
331
0
0
15 May 2025
FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs
Pencuo Zeren
Qiuming Luo
Rui Mao
Chang Kong
228
3
0
13 May 2025
Lost in Transmission: When and Why LLMs Fail to Reason Globally
Tobias Schnabel
Kiran Tomlinson
Adith Swaminathan
Jennifer Neville
LRM
690
2
0
13 May 2025
Learning Advanced Self-Attention for Linear Transformers in the Singular Value Domain
International Joint Conference on Artificial Intelligence (IJCAI), 2025
Hyowon Wi
Jeongwhan Choi
Noseong Park
351
2
0
13 May 2025
Fused3S: Fast Sparse Attention on Tensor Cores
International Conference on Supercomputing (ICS), 2025
Zitong Li
Aparna Chandramowlishwaran
GNN
207
0
0
12 May 2025
A Split-then-Join Approach to Abstractive Summarization for Very Long Documents in a Low Resource Setting
Lhuqita Fazry
VLM
447
0
0
11 May 2025
Graph Laplacian Wavelet Transformer via Learnable Spectral Decomposition
Andrew Kiruluta
Eric Lundy
Priscilla Burity
206
1
0
09 May 2025
Small Clips, Big Gains: Learning Long-Range Refocused Temporal Information for Video Super-Resolution
Xingyu Zhou
Wei Long
Jingbo Lu
Shiyin Jiang
Weiyi You
Haifeng Wu
Shuhang Gu
278
0
0
04 May 2025
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
Piotr Piekos
Róbert Csordás
Jürgen Schmidhuber
MoE
VLM
590
7
0
01 May 2025
Polysemy of Synthetic Neurons Towards a New Type of Explanatory Categorical Vector Spaces
Michael Pichat
William Pogrund
Paloma Pichat
Judicael Poumay
Armanouche Gasparian
Samuel Demarchi
Martin Corbet
Alois Georgeon
Michael Veillet-Guillem
MILM
296
0
0
30 Apr 2025
From Attention to Atoms: Spectral Dictionary Learning for Fast, Interpretable Language Models
Andrew Kiruluta
233
1
0
29 Apr 2025
Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity
Ruifeng Ren
Yong Liu
970
1
0
26 Apr 2025
The Rise of Small Language Models in Healthcare: A Comprehensive Survey
Muskan Garg
Shaina Raza
Shebuti Rayana
Xingyi Liu
Sunghwan Sohn
LM&MA
AILaw
494
13
0
23 Apr 2025
Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light
Ali Hassani
Fengzhe Zhou
Aditya Kane
Jiannan Huang
Chieh-Yun Chen
...
Bing Xu
Haicheng Wu
Wen-mei W. Hwu
Xuan Li
Humphrey Shi
202
10
0
23 Apr 2025
Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access
Xiang Hu
Jiaqi Leng
Jun Zhao
Kewei Tu
Wei Wu
Mamba
440
1
0
23 Apr 2025
MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention
Yucheng Li
Huiqiang Jiang
Chengruidong Zhang
Qianhui Wu
Xufang Luo
...
Amir H. Abdi
Dongsheng Li
Jianfeng Gao
Yue Yang
Lili Qiu
357
17
0
22 Apr 2025
Previous
1
2
3
4
5
6
...
24
25
26
Next
Page 3 of 26
Page
of 26
Go