Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2012.09852
Cited By
v1
v2 (latest)
SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
International Symposium on High-Performance Computer Architecture (HPCA), 2020
17 December 2020
Hanrui Wang
Zhekai Zhang
Song Han
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning"
50 / 189 papers shown
FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation
Dong Liu
Jiayi Zhang
Jiayi Zhang
Yanxuan Yu
Ben Lengerich
Ying Nian Wu
323
16
0
30 Mar 2026
ESACT: An End-to-End Sparse Accelerator for Compute-Intensive Transformers via Local Similarity
Hongxiang Liu
Zhifang Deng
Tong Pu
Shengli Lu
251
0
0
02 Dec 2025
CAMformer: Associative Memory is All You Need
Tergel Molom-Ochir
Benjamin Morris
Mark Horton
Chiyue Wei
Cong Guo
...
Peter Liu
Shan X. Wang
Deliang Fan
Hai Helen Li
Yiran Chen
150
0
0
24 Nov 2025
KVSwap: Disk-aware KV Cache Offloading for Long-Context On-device Inference
H. Zhang
Chunwei Xia
Zheng Wang
SyDa
421
2
0
14 Nov 2025
QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations
Zhixiong Zhao
Haomin Li
Fangxin Liu
Yuncheng Lu
Zongwu Wang
Tao Yang
Li Jiang
Haibing Guan
340
3
0
10 Nov 2025
AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache
IACR Cryptology ePrint Archive (IACR ePrint), 2025
Dinghong Song
Yuan Feng
Y. Wang
S. Chen
Cyril Guyot
F. Blagojevic
Hyeran Jeon
Pengfei Su
Dong Li
312
1
0
29 Oct 2025
Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction
Mutian He
Philip N. Garner
CLL
305
2
0
23 Oct 2025
SOLE: Hardware-Software Co-design of Softmax and LayerNorm for Efficient Transformer Inference
Wenxun Wang
Shuchang Zhou
Wenyu Sun
Peiqin Sun
Y. Liu
175
44
0
20 Oct 2025
Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads
Zhoutong Wu
Y. Zhang
Yiming Dong
Chenheng Zhang
Cong Fang
Kun Yuan
Zhouchen Lin
208
1
0
19 Oct 2025
Low Power Vision Transformer Accelerator with Hardware-Aware Pruning and Optimized Dataflow
IEEE Transactions on Circuits and Systems Part 1: Regular Papers (TCAS-I), 2025
Ching-Lin Hsiung
Tian-Sheuan Chang
ViT
174
1
0
16 Oct 2025
Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing
Tianhua Xia
Sai Qian Zhang
122
3
0
16 Oct 2025
APCE: Adaptive Progressive Context Expansion for Long Context Processing
Baisub Lee
Sanghyun Byun
Mohanad Odema
Jung Guack
Jacob Song
Woo Seong Chung
105
0
0
14 Oct 2025
Long Exposure: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy Sparsity
International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2024
Tuowei Wang
Kun Li
Zixu Hao
Donglin Bai
Ju Ren
Yaoxue Zhang
Ting Cao
M. Yang
223
6
0
12 Oct 2025
Embodied AI: From LLMs to World Models
Tongtong Feng
Xin Wang
Yu Jiang
Wenwu Zhu
LM&Ro
457
22
0
24 Sep 2025
Walk and Read Less: Improving the Efficiency of Vision-and-Language Navigation via Tuning-Free Multimodal Token Pruning
Wenda Qin
Andrea Burns
Bryan A. Plummer
Margrit Betke
AAML
314
0
0
18 Sep 2025
LEGO: Spatial Accelerator Generation and Optimization for Tensor Applications
International Symposium on High-Performance Computer Architecture (HPCA), 2025
Yujun Lin
Zhekai Zhang
Song Han
290
2
0
15 Sep 2025
SQAP-VLA: A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models
Hengyu Fang
Yijiang Liu
Yuan Du
Li Du
Huanrui Yang
MQ
VLM
246
10
0
11 Sep 2025
KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache
Bo Jiang
Taolue Yang
Youyuan Liu
Chengming Zhang
Xubin He
Sian Jin
MQ
VLM
187
2
0
30 Aug 2025
Spatio-Temporal Pruning for Compressed Spiking Large Language Models
Yi Jiang
Malyaban Bal
Brian Matejek
Susmit Jha
Adam D. Cobb
Abhronil Sengupta
163
1
0
23 Aug 2025
ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference
Wangsong Yin
Daliang Xu
Mengwei Xu
Gang Huang
Xuanzhe Liu
224
5
0
22 Aug 2025
DPad: Efficient Diffusion Language Models with Suffix Dropout
Xinhua Chen
Sitao Huang
Cong Guo
Chiyue Wei
Yintao He
Jianyi Zhang
Xue Yang
Yiran Chen
173
26
0
19 Aug 2025
Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation
Qirui Li
Guangcong Zheng
Qi Zhao
Jie Li
Bin Dong
Jing Lin
Xi Li
VGen
191
4
0
18 Aug 2025
Computational Economics in Large Language Models: Exploring Model Behavior and Incentive Design under Resource Constraints
Sandeep Reddy
Kabir Khan
Rohit Patil
Ananya Chakraborty
Faizan A. Khan
Swati Kulkarni
Arjun Verma
Neha Singh
239
1
0
14 Aug 2025
KLLM: Fast LLM Inference with K-Means Quantization
Xueying Wu
Baijun Zhou
Zhihui Gao
Yuzhe Fu
Qilin Zheng
Yintao He
Hai Helen Li
MQ
353
0
0
30 Jul 2025
Early Attentive Sparsification Accelerates Neural Speech Transcription
Zifei Xu
Sayeh Sharify
Hesham Mostafa
T. Webb
W. Yazar
Xin Wang
202
0
0
18 Jun 2025
AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models
Yifeng Gu
Zicong Jiang
Jianxiu Jin
K. Guo
Ziyang Zhang
Xiangmin Xu
305
0
0
04 Jun 2025
Assortment of Attention Heads: Accelerating Federated PEFT with Head Pruning and Strategic Client Selection
Yeshwanth Venkatesha
Souvik Kundu
Priyadarshini Panda
219
1
0
31 May 2025
Accelerating Adaptive Retrieval Augmented Generation via Instruction-Driven Representation Reduction of Retrieval Overlaps
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Jie Ou
Jinyu Guo
Shuaihong Jiang
Zhaokun Wang
Libo Qin
Shunyu Yao
Wenhong Tian
3DV
675
4
0
19 May 2025
Bishop: Sparsified Bundling Spiking Transformers on Heterogeneous Cores with Error-Constrained Pruning
International Symposium on Computer Architecture (ISCA), 2025
Boxun Xu
Yuxuan Yin
Vikram Iyer
Peng Li
MoE
333
2
0
18 May 2025
Phi: Leveraging Pattern-based Hierarchical Sparsity for High-Efficiency Spiking Neural Networks
International Symposium on Computer Architecture (ISCA), 2025
Chiyue Wei
Bowen Duan
Cong Guo
Jing Zhang
Qingyue Song
Hai "Helen" Li
Yiran Chen
336
3
0
16 May 2025
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Zihan Qiu
Zhaoxiang Wang
Bo Zheng
Zeyu Huang
Kaiyue Wen
...
Fei Huang
Suozhi Huang
Dayiheng Liu
Jingren Zhou
Junyang Lin
MoE
1.1K
127
0
10 May 2025
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
Zayd Muhammad Kawakibi Zuhri
Erland Hilman Fuadi
Alham Fikri Aji
346
10
0
29 Apr 2025
Efficient Pretraining Length Scaling
Bohong Wu
Shen Yan
Sijun Zhang
Jianqiao Lu
Yutao Zeng
Ya Wang
Xun Zhou
1.2K
2
0
21 Apr 2025
TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval
Information Fusion (Inf. Fusion), 2025
Xiaolun Jing
Genke Yang
Jian Chu
258
5
0
07 Apr 2025
Saliency-driven Dynamic Token Pruning for Large Language Models
Yao Tao
Yehui Tang
Yun Wang
Mingjian Zhu
Hailin Hu
Yunhe Wang
541
4
0
06 Apr 2025
Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
International Symposium on Computer Architecture (ISCA), 2025
Minsu Kim
Seongmin Hong
RyeoWook Ko
S. Choi
Hunjong Lee
Junsoo Kim
Joo-Young Kim
Jongse Park
365
13
0
24 Mar 2025
AxBERT: An Interpretable Chinese Spelling Correction Method Driven by Associative Knowledge Network
Fanyu Wang
Hangyu Zhu
Zhenping Xie
259
0
0
04 Mar 2025
Attention Condensation via Sparsity Induced Regularized Training
Eli Sason
Darya Frolova
Boris Nazarov
Felix Goldberd
1.1K
0
0
03 Mar 2025
CipherPrune: Efficient and Scalable Private Transformer Inference
International Conference on Learning Representations (ICLR), 2025
Yancheng Zhang
Jinbao Xue
Mengxin Zheng
Mimi Xie
Mingzhe Zhang
Lei Jiang
Qian Lou
457
16
0
24 Feb 2025
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2025
Yintao He
Haiyu Mao
Christina Giannoula
Mohammad Sadrosadati
Juan Gómez Luna
Huawei Li
Xiaowei Li
Ying Wang
O. Mutlu
501
34
0
21 Feb 2025
Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding
Konstantin Berestizshevsky
Renzo Andri
Lukas Cavigelli
536
2
0
12 Feb 2025
Breaking Down Bias: On The Limits of Generalizable Pruning Strategies
Conference on Fairness, Accountability and Transparency (FAccT), 2025
Sibo Ma
Alejandro Salinas
Peter Henderson
Julian Nyarko
234
2
0
11 Feb 2025
Ditto: Accelerating Diffusion Model via Temporal Value Similarity
International Symposium on High-Performance Computer Architecture (HPCA), 2025
Sungbin Kim
Hyunwuk Lee
Wonho Cho
Mincheol Park
Won Woo Ro
538
17
0
20 Jan 2025
MPCache: MPC-Friendly KV Cache Eviction for Efficient Private LLM Inference
Wenxuan Zeng
Ye Dong
Jinjin Zhou
Jin Tan
Jin Tan
Tao Wei
Runsheng Wang
Meng Li
369
2
0
12 Jan 2025
EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion Models
International Symposium on High-Performance Computer Architecture (HPCA), 2025
Jaehoon Heo
Adiwena Putra
Jieon Yoon
Sungwoong Yune
Hangyeol Lee
Ji-Hoon Kim
Joo-Young Kim
DiffM
381
8
0
10 Jan 2025
Multimodal joint prediction of traffic spatial-temporal data with graph sparse attention mechanism and bidirectional temporal convolutional network
Advanced Engineering Informatics (AEI), 2024
Dongran Zhang
Jiangnan Yan
K. Polat
A. Alhudhaif
Jun Li
AI4TS
299
31
0
31 Dec 2024
Deploying Foundation Model Powered Agent Services: A Survey
Wenchao Xu
Jinyu Chen
Peirong Zheng
Xiaoquan Yi
Tianyi Tian
...
Quan Wan
Yining Qi
Yunfeng Fan
Qinliang Su
Xuemin Shen
AI4CE
564
8
0
18 Dec 2024
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
Yiwu Zhong
Zhuoming Liu
Yin Li
Liwei Wang
490
36
0
04 Dec 2024
SoftmAP: Software-Hardware Co-design for Integer-Only Softmax on Associative Processors
Design, Automation and Test in Europe (DATE), 2024
M. Rakka
Jiajian Li
Guohao Dai
A. Eltawil
M. Fouda
Fadi J. Kurdahi
363
3
0
26 Nov 2024
MixPE: Quantization and Hardware Co-design for Efficient LLM Inference
Yu Zhang
Ming Wang
Lancheng Zou
Wulong Liu
Hui-Ling Zhen
Mingxuan Yuan
Bei Yu
MQ
299
8
0
25 Nov 2024
1
2
3
4
Next
Page 1 of 4