ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2012.09852
  4. Cited By
SpAtten: Efficient Sparse Attention Architecture with Cascade Token and
  Head Pruning

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

17 December 2020
Hanrui Wang
Zhekai Zhang
Song Han
ArXivPDFHTML

Papers citing "SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning"

50 / 160 papers shown
Title
Focus on the Core: Efficient Attention via Pruned Token Compression for
  Document Classification
Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification
Jungmin Yun
Mihyeon Kim
Youngbin Kim
69
9
0
03 Jun 2024
The CAP Principle for LLM Serving: A Survey of Long-Context Large
  Language Model Serving
The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving
Pai Zeng
Zhenyu Ning
Jieru Zhao
Weihao Cui
Mengwei Xu
Liwei Guo
Xusheng Chen
Yizhou Shan
LLMAG
40
4
0
18 May 2024
A Survey on Transformers in NLP with Focus on Efficiency
A Survey on Transformers in NLP with Focus on Efficiency
Wazib Ansar
Saptarsi Goswami
Amlan Chakrabarti
MedIm
40
2
0
15 May 2024
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
Yujun Lin
Haotian Tang
Shang Yang
Zhekai Zhang
Guangxuan Xiao
Chuang Gan
Song Han
77
76
0
07 May 2024
A Survey on Efficient Inference for Large Language Models
A Survey on Efficient Inference for Large Language Models
Zixuan Zhou
Xuefei Ning
Ke Hong
Tianyu Fu
Jiaming Xu
...
Shengen Yan
Guohao Dai
Xiao-Ping Zhang
Yuhan Dong
Yu-Xiang Wang
46
83
0
22 Apr 2024
Parallel Decoding via Hidden Transfer for Lossless Large Language Model
  Acceleration
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration
Pengfei Wu
Jiahao Liu
Zhuocheng Gong
Qifan Wang
Jinpeng Li
Jingang Wang
Xunliang Cai
Dongyan Zhao
20
1
0
18 Apr 2024
Self-Selected Attention Span for Accelerating Large Language Model
  Inference
Self-Selected Attention Span for Accelerating Large Language Model Inference
Tian Jin
W. Yazar
Zifei Xu
Sayeh Sharify
Xin Eric Wang
LRM
27
1
0
14 Apr 2024
Lightweight Deep Learning for Resource-Constrained Environments: A
  Survey
Lightweight Deep Learning for Resource-Constrained Environments: A Survey
Hou-I Liu
Marco Galindo
Hongxia Xie
Lai-Kuan Wong
Hong-Han Shuai
Yung-Hui Li
Wen-Huang Cheng
55
48
0
08 Apr 2024
Towards Pareto Optimal Throughput in Small Language Model Serving
Towards Pareto Optimal Throughput in Small Language Model Serving
Pol G. Recasens
Yue Zhu
Chen Wang
Eun Kyung Lee
Olivier Tardieu
Alaa Youssef
Jordi Torres
Josep Ll. Berral
40
4
0
04 Apr 2024
CATP: Cross-Attention Token Pruning for Accuracy Preserved Multimodal
  Model Inference
CATP: Cross-Attention Token Pruning for Accuracy Preserved Multimodal Model Inference
Ruqi Liao
Chuqing Zhao
Jin Li
Weiqi Feng
19
0
0
02 Apr 2024
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV
  Caching
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching
Youpeng Zhao
Di Wu
Jun Wang
35
25
0
26 Mar 2024
Accelerating ViT Inference on FPGA through Static and Dynamic Pruning
Accelerating ViT Inference on FPGA through Static and Dynamic Pruning
Dhruv Parikh
Shouyi Li
Bingyi Zhang
Rajgopal Kannan
Carl E. Busart
Viktor Prasanna
40
1
0
21 Mar 2024
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Piotr Nawrot
Adrian Lañcucki
Marcin Chochowski
David Tarjan
E. Ponti
33
50
0
14 Mar 2024
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient
  Generative Inference
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference
Muhammad Adnan
Akhil Arunkumar
Gaurav Jain
Prashant J. Nair
Ilya Soloveychik
Purushotham Kamath
33
52
0
14 Mar 2024
CHAI: Clustered Head Attention for Efficient LLM Inference
CHAI: Clustered Head Attention for Efficient LLM Inference
Saurabh Agarwal
Bilge Acun
Basil Homer
Mostafa Elhoushi
Yejin Lee
Shivaram Venkataraman
Dimitris Papailiopoulos
Carole-Jean Wu
53
8
0
12 Mar 2024
Model Compression and Efficient Inference for Large Language Models: A
  Survey
Model Compression and Efficient Inference for Large Language Models: A Survey
Wenxiao Wang
Wei Chen
Yicong Luo
Yongliu Long
Zhengkai Lin
Liye Zhang
Binbin Lin
Deng Cai
Xiaofei He
MQ
41
47
0
15 Feb 2024
HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
  Inference
HiRE: High Recall Approximate Top-kkk Estimation for Efficient LLM Inference
Yashas Samaga
Varun Yerram
Chong You
Srinadh Bhojanapalli
Sanjiv Kumar
Prateek Jain
Praneeth Netrapalli
51
4
0
14 Feb 2024
PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity
  Recognition
PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity Recognition
Jinghui Lu
Ziwei Yang
Yanjie Wang
Xuejing Liu
Brian Mac Namee
Can Huang
MoE
45
4
0
07 Feb 2024
Compressing Deep Reinforcement Learning Networks with a Dynamic
  Structured Pruning Method for Autonomous Driving
Compressing Deep Reinforcement Learning Networks with a Dynamic Structured Pruning Method for Autonomous Driving
Wensheng Su
Zhenni Li
Minrui Xu
Jiawen Kang
Dusit Niyato
Shengli Xie
18
8
0
07 Feb 2024
ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters
ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters
Shiwei Liu
Guanchen Tao
Yifei Zou
Derek Chow
Zichen Fan
Kauna Lei
Bangfei Pan
Dennis Sylvester
Gregory Kielian
Mehdi Saligane
21
7
0
31 Jan 2024
A Survey on Hardware Accelerators for Large Language Models
A Survey on Hardware Accelerators for Large Language Models
C. Kachris
31
14
0
18 Jan 2024
A Temporal-Spectral Fusion Transformer with Subject-Specific Adapter for
  Enhancing RSVP-BCI Decoding
A Temporal-Spectral Fusion Transformer with Subject-Specific Adapter for Enhancing RSVP-BCI Decoding
Xujin Li
Wei Wei
Shuang Qiu
Huiguang He
21
1
0
12 Jan 2024
FlightLLM: Efficient Large Language Model Inference with a Complete
  Mapping Flow on FPGAs
FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs
Shulin Zeng
Jun Liu
Guohao Dai
Xinhao Yang
Tianyu Fu
...
Zehao Wang
Ruoyu Zhang
Kairui Wen
Xuefei Ning
Yu Wang
54
55
0
08 Jan 2024
A Heterogeneous Chiplet Architecture for Accelerating End-to-End
  Transformer Models
A Heterogeneous Chiplet Architecture for Accelerating End-to-End Transformer Models
Harsh Sharma
Pratyush Dhingra
J. Doppa
Ümit Y. Ogras
P. Pande
32
7
0
18 Dec 2023
LLM in a flash: Efficient Large Language Model Inference with Limited
  Memory
LLM in a flash: Efficient Large Language Model Inference with Limited Memory
Keivan Alizadeh-Vahid
Iman Mirzadeh
Dmitry Belenko
Karen Khatamifard
Minsik Cho
C. C. D. Mundo
Mohammad Rastegari
Mehrdad Farajtabar
72
112
0
12 Dec 2023
A Hardware Evaluation Framework for Large Language Model Inference
A Hardware Evaluation Framework for Large Language Model Inference
Hengrui Zhang
August Ning
R. Prabhakar
D. Wentzlaff
ELM
10
19
0
05 Dec 2023
PLUM: Improving Inference Efficiency By Leveraging Repetition-Sparsity Trade-Off
PLUM: Improving Inference Efficiency By Leveraging Repetition-Sparsity Trade-Off
Sachit Kuhar
Yash Jain
Alexey Tumanov
MQ
54
0
0
04 Dec 2023
Transformer-QEC: Quantum Error Correction Code Decoding with
  Transferable Transformers
Transformer-QEC: Quantum Error Correction Code Decoding with Transferable Transformers
Hanrui Wang
Pengyu Liu
Kevin Shao
Dantong Li
Jiaqi Gu
David Z. Pan
Yongshan Ding
Song Han
14
7
0
27 Nov 2023
REST: Retrieval-Based Speculative Decoding
REST: Retrieval-Based Speculative Decoding
Zhenyu He
Zexuan Zhong
Tianle Cai
Jason D. Lee
Di He
RALM
17
76
0
14 Nov 2023
Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse
  Multi-DNN Workloads
Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse Multi-DNN Workloads
Hongxiang Fan
Stylianos I. Venieris
Alexandros Kouris
Nicholas D. Lane
21
7
0
17 Oct 2023
Efficient Streaming Language Models with Attention Sinks
Efficient Streaming Language Models with Attention Sinks
Michel Lang
Yuandong Tian
Beidi Chen
Song Han
Mike Lewis
AI4TS
RALM
30
639
0
29 Sep 2023
LLMCad: Fast and Scalable On-device Large Language Model Inference
LLMCad: Fast and Scalable On-device Large Language Model Inference
Daliang Xu
Wangsong Yin
Xin Jin
Y. Zhang
Shiyun Wei
Mengwei Xu
Xuanzhe Liu
17
43
0
08 Sep 2023
Mobile Foundation Model as Firmware
Mobile Foundation Model as Firmware
Jinliang Yuan
Chenchen Yang
Dongqi Cai
Shihe Wang
Xin Yuan
...
Di Zhang
Hanzi Mei
Xianqing Jia
Shangguang Wang
Mengwei Xu
40
19
0
28 Aug 2023
Discrete Prompt Compression with Reinforcement Learning
Discrete Prompt Compression with Reinforcement Learning
Hoyoun Jung
Kyung-Joong Kim
21
24
0
17 Aug 2023
A Survey of Techniques for Optimizing Transformer Inference
A Survey of Techniques for Optimizing Transformer Inference
Krishna Teja Chitty-Venkata
Sparsh Mittal
M. Emani
V. Vishwanath
Arun Somani
40
62
0
16 Jul 2023
ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized
  Transformers
ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers
Gamze Islamoglu
Moritz Scherer
G. Paulin
Tim Fischer
Victor J. B. Jung
Angelo Garofalo
Luca Benini
MQ
22
11
0
07 Jul 2023
Accelerating Transducers through Adjacent Token Merging
Accelerating Transducers through Adjacent Token Merging
Yuang Li
Yu-Huan Wu
Jinyu Li
Shujie Liu
22
4
0
28 Jun 2023
Constraint-aware and Ranking-distilled Token Pruning for Efficient
  Transformer Inference
Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference
Junyan Li
Li Lyna Zhang
Jiahang Xu
Yujing Wang
Shaoguang Yan
...
Ting Cao
Hao-Lun Sun
Weiwei Deng
Qi Zhang
Mao Yang
33
10
0
26 Jun 2023
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large
  Language Models
H2_22​O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
Zhenyu (Allen) Zhang
Ying Sheng
Tianyi Zhou
Tianlong Chen
Lianmin Zheng
...
Yuandong Tian
Christopher Ré
Clark W. Barrett
Zhangyang Wang
Beidi Chen
VLM
47
252
0
24 Jun 2023
S$^{3}$: Increasing GPU Utilization during Generative Inference for
  Higher Throughput
S3^{3}3: Increasing GPU Utilization during Generative Inference for Higher Throughput
Yunho Jin
Chun-Feng Wu
David Brooks
Gu-Yeon Wei
29
62
0
09 Jun 2023
Faster Causal Attention Over Large Sequences Through Sparse Flash
  Attention
Faster Causal Attention Over Large Sequences Through Sparse Flash Attention
Matteo Pagliardini
Daniele Paliotta
Martin Jaggi
Franccois Fleuret
LRM
15
22
0
01 Jun 2023
AWQ: Activation-aware Weight Quantization for LLM Compression and
  Acceleration
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Ji Lin
Jiaming Tang
Haotian Tang
Shang Yang
Wei-Ming Chen
Wei-Chen Wang
Guangxuan Xiao
Xingyu Dang
Chuang Gan
Song Han
EDL
MQ
36
466
0
01 Jun 2023
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating
  Vision-Language Transformers
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers
Dachuan Shi
Chaofan Tao
Anyi Rao
Zhendong Yang
Chun Yuan
Jiaqi Wang
VLM
30
22
0
27 May 2023
Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention
  Graph in Pre-Trained Transformers
Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers
Hongjie Wang
Bhishma Dedhia
N. Jha
ViT
VLM
41
26
0
27 May 2023
SmartTrim: Adaptive Tokens and Attention Pruning for Efficient
  Vision-Language Models
SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models
Zekun Wang
Jingchang Chen
Wangchunshu Zhou
Haichao Zhu
Jiafeng Liang
Liping Shan
Ming Liu
Dongliang Xu
Qing Yang
Bing Qin
VLM
19
4
0
24 May 2023
NeuralMatrix: Compute the Entire Neural Networks with Linear Matrix
  Operations for Efficient Inference
NeuralMatrix: Compute the Entire Neural Networks with Linear Matrix Operations for Efficient Inference
Ruiqi Sun
Siwei Ye
Jie Zhao
Xin He
Yiran Li
An Zou
35
0
0
23 May 2023
Integer or Floating Point? New Outlooks for Low-Bit Quantization on
  Large Language Models
Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models
Yijia Zhang
Lingran Zhao
Shijie Cao
Wenqiang Wang
Ting Cao
Fan Yang
Mao Yang
Shanghang Zhang
Ningyi Xu
MQ
25
17
0
21 May 2023
Boost Vision Transformer with GPU-Friendly Sparsity and Quantization
Boost Vision Transformer with GPU-Friendly Sparsity and Quantization
Chong Yu
Tao Chen
Zhongxue Gan
Jiayuan Fan
MQ
ViT
25
23
0
18 May 2023
SpecInfer: Accelerating Generative Large Language Model Serving with
  Tree-based Speculative Inference and Verification
SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification
Xupeng Miao
Gabriele Oliaro
Zhihao Zhang
Xinhao Cheng
Zeyu Wang
...
Chunan Shi
Zhuoming Chen
Daiyaan Arfeen
Reyna Abhyankar
Zhihao Jia
LRM
48
118
0
16 May 2023
Tomography of Quantum States from Structured Measurements via quantum-aware transformer
Tomography of Quantum States from Structured Measurements via quantum-aware transformer
Hailan Ma
Zhenhong Sun
Daoyi Dong
Chunlin Chen
H. Rabitz
33
3
0
09 May 2023
Previous
1234
Next