ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2012.09852
  4. Cited By
SpAtten: Efficient Sparse Attention Architecture with Cascade Token and
  Head Pruning
v1v2 (latest)

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

International Symposium on High-Performance Computer Architecture (HPCA), 2020
17 December 2020
Hanrui Wang
Zhekai Zhang
Song Han
ArXiv (abs)PDFHTML

Papers citing "SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning"

50 / 189 papers shown
FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation
FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation
Dong Liu
Jiayi Zhang
Jiayi Zhang
Yanxuan Yu
Ben Lengerich
Ying Nian Wu
323
16
0
30 Mar 2026
ESACT: An End-to-End Sparse Accelerator for Compute-Intensive Transformers via Local Similarity
ESACT: An End-to-End Sparse Accelerator for Compute-Intensive Transformers via Local Similarity
Hongxiang Liu
Zhifang Deng
Tong Pu
Shengli Lu
251
0
0
02 Dec 2025
CAMformer: Associative Memory is All You Need
CAMformer: Associative Memory is All You Need
Tergel Molom-Ochir
Benjamin Morris
Mark Horton
Chiyue Wei
Cong Guo
...
Peter Liu
Shan X. Wang
Deliang Fan
Hai Helen Li
Yiran Chen
150
0
0
24 Nov 2025
KVSwap: Disk-aware KV Cache Offloading for Long-Context On-device Inference
KVSwap: Disk-aware KV Cache Offloading for Long-Context On-device Inference
H. Zhang
Chunwei Xia
Zheng Wang
SyDa
421
2
0
14 Nov 2025
QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations
QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations
Zhixiong Zhao
Haomin Li
Fangxin Liu
Yuncheng Lu
Zongwu Wang
Tao Yang
Li Jiang
Haibing Guan
340
3
0
10 Nov 2025
AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache
AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention CacheIACR Cryptology ePrint Archive (IACR ePrint), 2025
Dinghong Song
Yuan Feng
Y. Wang
S. Chen
Cyril Guyot
F. Blagojevic
Hyeran Jeon
Pengfei Su
Dong Li
312
1
0
29 Oct 2025
Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction
Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction
Mutian He
Philip N. Garner
CLL
305
2
0
23 Oct 2025
SOLE: Hardware-Software Co-design of Softmax and LayerNorm for Efficient Transformer Inference
SOLE: Hardware-Software Co-design of Softmax and LayerNorm for Efficient Transformer Inference
Wenxun Wang
Shuchang Zhou
Wenyu Sun
Peiqin Sun
Y. Liu
175
44
0
20 Oct 2025
Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads
Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads
Zhoutong Wu
Y. Zhang
Yiming Dong
Chenheng Zhang
Cong Fang
Kun Yuan
Zhouchen Lin
208
1
0
19 Oct 2025
Low Power Vision Transformer Accelerator with Hardware-Aware Pruning and Optimized Dataflow
Low Power Vision Transformer Accelerator with Hardware-Aware Pruning and Optimized DataflowIEEE Transactions on Circuits and Systems Part 1: Regular Papers (TCAS-I), 2025
Ching-Lin Hsiung
Tian-Sheuan Chang
ViT
174
1
0
16 Oct 2025
Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing
Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing
Tianhua Xia
Sai Qian Zhang
122
3
0
16 Oct 2025
APCE: Adaptive Progressive Context Expansion for Long Context Processing
APCE: Adaptive Progressive Context Expansion for Long Context Processing
Baisub Lee
Sanghyun Byun
Mohanad Odema
Jung Guack
Jacob Song
Woo Seong Chung
105
0
0
14 Oct 2025
Long Exposure: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy Sparsity
Long Exposure: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy SparsityInternational Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2024
Tuowei Wang
Kun Li
Zixu Hao
Donglin Bai
Ju Ren
Yaoxue Zhang
Ting Cao
M. Yang
223
6
0
12 Oct 2025
Embodied AI: From LLMs to World Models
Embodied AI: From LLMs to World Models
Tongtong Feng
Xin Wang
Yu Jiang
Wenwu Zhu
LM&Ro
457
22
0
24 Sep 2025
Walk and Read Less: Improving the Efficiency of Vision-and-Language Navigation via Tuning-Free Multimodal Token Pruning
Walk and Read Less: Improving the Efficiency of Vision-and-Language Navigation via Tuning-Free Multimodal Token Pruning
Wenda Qin
Andrea Burns
Bryan A. Plummer
Margrit Betke
AAML
314
0
0
18 Sep 2025
LEGO: Spatial Accelerator Generation and Optimization for Tensor Applications
LEGO: Spatial Accelerator Generation and Optimization for Tensor ApplicationsInternational Symposium on High-Performance Computer Architecture (HPCA), 2025
Yujun Lin
Zhekai Zhang
Song Han
290
2
0
15 Sep 2025
SQAP-VLA: A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models
SQAP-VLA: A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models
Hengyu Fang
Yijiang Liu
Yuan Du
Li Du
Huanrui Yang
MQVLM
246
10
0
11 Sep 2025
KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache
KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache
Bo Jiang
Taolue Yang
Youyuan Liu
Chengming Zhang
Xubin He
Sian Jin
MQVLM
187
2
0
30 Aug 2025
Spatio-Temporal Pruning for Compressed Spiking Large Language Models
Spatio-Temporal Pruning for Compressed Spiking Large Language Models
Yi Jiang
Malyaban Bal
Brian Matejek
Susmit Jha
Adam D. Cobb
Abhronil Sengupta
163
1
0
23 Aug 2025
ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference
ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference
Wangsong Yin
Daliang Xu
Mengwei Xu
Gang Huang
Xuanzhe Liu
224
5
0
22 Aug 2025
DPad: Efficient Diffusion Language Models with Suffix Dropout
DPad: Efficient Diffusion Language Models with Suffix Dropout
Xinhua Chen
Sitao Huang
Cong Guo
Chiyue Wei
Yintao He
Jianyi Zhang
Xue Yang
Yiran Chen
173
26
0
19 Aug 2025
Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation
Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation
Qirui Li
Guangcong Zheng
Qi Zhao
Jie Li
Bin Dong
Jing Lin
Xi Li
VGen
191
4
0
18 Aug 2025
Computational Economics in Large Language Models: Exploring Model Behavior and Incentive Design under Resource Constraints
Computational Economics in Large Language Models: Exploring Model Behavior and Incentive Design under Resource Constraints
Sandeep Reddy
Kabir Khan
Rohit Patil
Ananya Chakraborty
Faizan A. Khan
Swati Kulkarni
Arjun Verma
Neha Singh
239
1
0
14 Aug 2025
KLLM: Fast LLM Inference with K-Means Quantization
KLLM: Fast LLM Inference with K-Means Quantization
Xueying Wu
Baijun Zhou
Zhihui Gao
Yuzhe Fu
Qilin Zheng
Yintao He
Hai Helen Li
MQ
353
0
0
30 Jul 2025
Early Attentive Sparsification Accelerates Neural Speech Transcription
Early Attentive Sparsification Accelerates Neural Speech Transcription
Zifei Xu
Sayeh Sharify
Hesham Mostafa
T. Webb
W. Yazar
Xin Wang
202
0
0
18 Jun 2025
AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models
AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models
Yifeng Gu
Zicong Jiang
Jianxiu Jin
K. Guo
Ziyang Zhang
Xiangmin Xu
305
0
0
04 Jun 2025
Assortment of Attention Heads: Accelerating Federated PEFT with Head Pruning and Strategic Client Selection
Assortment of Attention Heads: Accelerating Federated PEFT with Head Pruning and Strategic Client Selection
Yeshwanth Venkatesha
Souvik Kundu
Priyadarshini Panda
219
1
0
31 May 2025
Accelerating Adaptive Retrieval Augmented Generation via Instruction-Driven Representation Reduction of Retrieval Overlaps
Accelerating Adaptive Retrieval Augmented Generation via Instruction-Driven Representation Reduction of Retrieval OverlapsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Jie Ou
Jinyu Guo
Shuaihong Jiang
Zhaokun Wang
Libo Qin
Shunyu Yao
Wenhong Tian
3DV
675
4
0
19 May 2025
Bishop: Sparsified Bundling Spiking Transformers on Heterogeneous Cores with Error-Constrained Pruning
Bishop: Sparsified Bundling Spiking Transformers on Heterogeneous Cores with Error-Constrained PruningInternational Symposium on Computer Architecture (ISCA), 2025
Boxun Xu
Yuxuan Yin
Vikram Iyer
Peng Li
MoE
333
2
0
18 May 2025
Phi: Leveraging Pattern-based Hierarchical Sparsity for High-Efficiency Spiking Neural Networks
Phi: Leveraging Pattern-based Hierarchical Sparsity for High-Efficiency Spiking Neural NetworksInternational Symposium on Computer Architecture (ISCA), 2025
Chiyue Wei
Bowen Duan
Cong Guo
Jing Zhang
Qingyue Song
Hai "Helen" Li
Yiran Chen
336
3
0
16 May 2025
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Zihan Qiu
Zhaoxiang Wang
Bo Zheng
Zeyu Huang
Kaiyue Wen
...
Fei Huang
Suozhi Huang
Dayiheng Liu
Jingren Zhou
Junyang Lin
MoE
1.1K
127
0
10 May 2025
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
Zayd Muhammad Kawakibi Zuhri
Erland Hilman Fuadi
Alham Fikri Aji
346
10
0
29 Apr 2025
Efficient Pretraining Length Scaling
Efficient Pretraining Length Scaling
Bohong Wu
Shen Yan
Sijun Zhang
Jianqiao Lu
Yutao Zeng
Ya Wang
Xun Zhou
1.2K
2
0
21 Apr 2025
TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval
TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video RetrievalInformation Fusion (Inf. Fusion), 2025
Xiaolun Jing
Genke Yang
Jian Chu
258
5
0
07 Apr 2025
Saliency-driven Dynamic Token Pruning for Large Language Models
Saliency-driven Dynamic Token Pruning for Large Language Models
Yao Tao
Yehui Tang
Yun Wang
Mingjian Zhu
Hailin Hu
Yunhe Wang
541
4
0
06 Apr 2025
Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache QuantizationInternational Symposium on Computer Architecture (ISCA), 2025
Minsu Kim
Seongmin Hong
RyeoWook Ko
S. Choi
Hunjong Lee
Junsoo Kim
Joo-Young Kim
Jongse Park
365
13
0
24 Mar 2025
AxBERT: An Interpretable Chinese Spelling Correction Method Driven by Associative Knowledge Network
AxBERT: An Interpretable Chinese Spelling Correction Method Driven by Associative Knowledge Network
Fanyu Wang
Hangyu Zhu
Zhenping Xie
259
0
0
04 Mar 2025
Attention Condensation via Sparsity Induced Regularized Training
Eli Sason
Darya Frolova
Boris Nazarov
Felix Goldberd
1.1K
0
0
03 Mar 2025
CipherPrune: Efficient and Scalable Private Transformer Inference
CipherPrune: Efficient and Scalable Private Transformer InferenceInternational Conference on Learning Representations (ICLR), 2025
Yancheng Zhang
Jinbao Xue
Mengxin Zheng
Mimi Xie
Mingzhe Zhang
Lei Jiang
Qian Lou
457
16
0
24 Feb 2025
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing SystemInternational Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2025
Yintao He
Haiyu Mao
Christina Giannoula
Mohammad Sadrosadati
Juan Gómez Luna
Huawei Li
Xiaowei Li
Ying Wang
O. Mutlu
501
34
0
21 Feb 2025
Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding
Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding
Konstantin Berestizshevsky
Renzo Andri
Lukas Cavigelli
536
2
0
12 Feb 2025
Breaking Down Bias: On The Limits of Generalizable Pruning Strategies
Breaking Down Bias: On The Limits of Generalizable Pruning StrategiesConference on Fairness, Accountability and Transparency (FAccT), 2025
Sibo Ma
Alejandro Salinas
Peter Henderson
Julian Nyarko
234
2
0
11 Feb 2025
Ditto: Accelerating Diffusion Model via Temporal Value Similarity
Ditto: Accelerating Diffusion Model via Temporal Value SimilarityInternational Symposium on High-Performance Computer Architecture (HPCA), 2025
Sungbin Kim
Hyunwuk Lee
Wonho Cho
Mincheol Park
Won Woo Ro
538
17
0
20 Jan 2025
MPCache: MPC-Friendly KV Cache Eviction for Efficient Private LLM Inference
MPCache: MPC-Friendly KV Cache Eviction for Efficient Private LLM Inference
Wenxuan Zeng
Ye Dong
Jinjin Zhou
Jin Tan
Jin Tan
Tao Wei
Runsheng Wang
Meng Li
369
2
0
12 Jan 2025
EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion Models
EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion ModelsInternational Symposium on High-Performance Computer Architecture (HPCA), 2025
Jaehoon Heo
Adiwena Putra
Jieon Yoon
Sungwoong Yune
Hangyeol Lee
Ji-Hoon Kim
Joo-Young Kim
DiffM
381
8
0
10 Jan 2025
Multimodal joint prediction of traffic spatial-temporal data with graph sparse attention mechanism and bidirectional temporal convolutional network
Multimodal joint prediction of traffic spatial-temporal data with graph sparse attention mechanism and bidirectional temporal convolutional networkAdvanced Engineering Informatics (AEI), 2024
Dongran Zhang
Jiangnan Yan
K. Polat
A. Alhudhaif
Jun Li
AI4TS
299
31
0
31 Dec 2024
Deploying Foundation Model Powered Agent Services: A Survey
Deploying Foundation Model Powered Agent Services: A Survey
Wenchao Xu
Jinyu Chen
Peirong Zheng
Xiaoquan Yi
Tianyi Tian
...
Quan Wan
Yining Qi
Yunfeng Fan
Qinliang Su
Xuemin Shen
AI4CE
564
8
0
18 Dec 2024
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
Yiwu Zhong
Zhuoming Liu
Yin Li
Liwei Wang
490
36
0
04 Dec 2024
SoftmAP: Software-Hardware Co-design for Integer-Only Softmax on
  Associative Processors
SoftmAP: Software-Hardware Co-design for Integer-Only Softmax on Associative ProcessorsDesign, Automation and Test in Europe (DATE), 2024
M. Rakka
Jiajian Li
Guohao Dai
A. Eltawil
M. Fouda
Fadi J. Kurdahi
363
3
0
26 Nov 2024
MixPE: Quantization and Hardware Co-design for Efficient LLM Inference
MixPE: Quantization and Hardware Co-design for Efficient LLM Inference
Yu Zhang
Ming Wang
Lancheng Zou
Wulong Liu
Hui-Ling Zhen
Mingxuan Yuan
Bei Yu
MQ
299
8
0
25 Nov 2024
1234
Next
Page 1 of 4