ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2406.10774
  4. Cited By
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

16 June 2024
Jiaming Tang
Yilong Zhao
Kan Zhu
Guangxuan Xiao
Baris Kasikci
Song Han
ArXivPDFHTML

Papers citing "Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference"

50 / 53 papers shown
Title
Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM
Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM
Zehao Fan
Garrett Gagnon
Zhenyu Liu
Liu Liu
29
0
0
09 May 2025
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
Y. Chen
J. Zhang
Baotong Lu
Qianxi Zhang
Chengruidong Zhang
...
Chen Chen
Mingxing Zhang
Yuqing Yang
Fan Yang
Mao Yang
38
0
0
05 May 2025
Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions
Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions
Yiming Du
Wenyu Huang
Danna Zheng
Zhaowei Wang
Sébastien Montella
Mirella Lapata
Kam-Fai Wong
Jeff Z. Pan
KELM
MU
78
2
0
01 May 2025
The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
Piotr Nawrot
Robert Li
Renjie Huang
Sebastian Ruder
Kelly Marchisio
E. Ponti
34
0
0
24 Apr 2025
KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments
KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments
Junyoung Park
Dalton Jones
Matt Morse
Raghavv Goel
Mingu Lee
Chris Lott
22
0
0
21 Apr 2025
Efficient Pretraining Length Scaling
Efficient Pretraining Length Scaling
Bohong Wu
Shen Yan
Sijun Zhang
Jianqiao Lu
Yutao Zeng
Ya Wang
Xun Zhou
127
0
0
21 Apr 2025
AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference
AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference
Yangshen Deng
Zhengxin You
Long Xiang
Qilong Li
Peiqi Yuan
...
Man Lung Yiu
Huan Li
Qiaomu Shen
Rui Mao
Bo Tang
31
0
0
14 Apr 2025
Adaptive Computation Pruning for the Forgetting Transformer
Adaptive Computation Pruning for the Forgetting Transformer
Zhixuan Lin
J. Obando-Ceron
Xu Owen He
Aaron C. Courville
32
0
0
09 Apr 2025
AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design
AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design
Yanbiao Liang
Huihong Shi
Haikuo Shao
Zhongfeng Wang
28
0
0
07 Apr 2025
Cognitive Memory in Large Language Models
Cognitive Memory in Large Language Models
Lianlei Shan
Shixian Luo
Zezhou Zhu
Yu Yuan
Yong Wu
LLMAG
KELM
154
1
0
03 Apr 2025
SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
Yuxuan Zhu
Ali Falahati
David H. Yang
Mohammad Mohammadi Amiri
58
0
0
01 Apr 2025
SQuat: Subspace-orthogonal KV Cache Quantization
SQuat: Subspace-orthogonal KV Cache Quantization
Hao Wang
Ligong Han
Kai Xu
Akash Srivastava
MQ
46
0
0
31 Mar 2025
SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs
SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs
Shibo Jie
Yehui Tang
Kai Han
Zhi-Hong Deng
Jing Han
95
0
0
20 Mar 2025
LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference
G. Wang
Shubhangi Upasani
Chen Henry Wu
Darshan Gandhi
Jonathan Li
Changran Hu
Bo Li
Urmish Thakker
77
0
0
11 Mar 2025
Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving
Qihui Zhou
Peiqi Yin
Pengfei Zuo
James Cheng
CLL
40
1
0
01 Mar 2025
FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
Xunhao Lai
Jianqiao Lu
Yao Luo
Yiyuan Ma
Xun Zhou
63
5
0
28 Feb 2025
SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention
SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention
Hong Yankun
Li Xing
Zhen Hui-Ling
Yu Xianzhi
Liu Wulong
Yuan Mingxuan
MQ
78
0
0
24 Feb 2025
Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity
Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity
Junhao Hu
Wenrui Huang
Weidong Wang
Zhenwen Li
Tiancheng Hu
Zhixia Liu
Xusheng Chen
Tao Xie
Yizhou Shan
LRM
46
0
0
16 Feb 2025
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU
Heejun Lee
G. Park
Jaduk Suh
Sung Ju Hwang
82
1
0
13 Feb 2025
Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding
Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding
Konstantin Berestizshevsky
Renzo Andri
Lukas Cavigelli
80
1
0
12 Feb 2025
Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective
Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective
Yuan Feng
Junlin Lv
Y. Cao
Xike Xie
S.Kevin Zhou
71
2
0
06 Feb 2025
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
Rishabh Tiwari
Haocheng Xi
Aditya Tomar
Coleman Hooper
Sehoon Kim
Maxwell Horton
Mahyar Najibi
Michael W. Mahoney
K. K.
Amir Gholami
MQ
56
1
0
05 Feb 2025
Can LLMs Maintain Fundamental Abilities under KV Cache Compression?
Can LLMs Maintain Fundamental Abilities under KV Cache Compression?
Xiang Liu
Zhenheng Tang
Hong Chen
Peijie Dong
Zeyu Li
Xiuze Zhou
Bo Li
Xuming Hu
Xiaowen Chu
166
3
0
04 Feb 2025
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Yuan Feng
Junlin Lv
Yukun Cao
Xike Xie
S. K. Zhou
VLM
53
27
0
28 Jan 2025
KVDirect: Distributed Disaggregated LLM Inference
Shiyang Chen
Rain Jiang
Dezhi Yu
Jinlai Xu
Mengyuan Chao
Fanlong Meng
Chenyu Jiang
Wei Xu
Hang Liu
40
1
0
28 Jan 2025
MPCache: MPC-Friendly KV Cache Eviction for Efficient Private Large Language Model Inference
MPCache: MPC-Friendly KV Cache Eviction for Efficient Private Large Language Model Inference
Wenxuan Zeng
Ye Dong
Jinjin Zhou
Junming Ma
Jin Tan
Runsheng Wang
Meng Li
49
0
0
12 Jan 2025
Tensor Product Attention Is All You Need
Tensor Product Attention Is All You Need
Yifan Zhang
Yifeng Liu
Huizhuo Yuan
Zhen Qin
Yang Yuan
Q. Gu
Andrew Chi-Chih Yao
77
9
0
11 Jan 2025
AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference
AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference
Zhuomin He
Yizhen Yao
Pengfei Zuo
Bin Gao
Qinya Li
Zhenzhe Zheng
Fan Wu
52
0
0
04 Jan 2025
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
Di Liu
Meng Chen
Baotong Lu
Huiqiang Jiang
Zhenhua Han
...
K. Zhang
C. L. P. Chen
Fan Yang
Y. Yang
Lili Qiu
46
29
0
03 Jan 2025
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Zihao Ye
Lequn Chen
Ruihang Lai
Wuwei Lin
Yineng Zhang
...
Tianqi Chen
Baris Kasikci
Vinod Grover
Arvind Krishnamurthy
Luis Ceze
65
21
0
02 Jan 2025
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
Zhisheng Zhong
Chengyao Wang
Yuqi Liu
Senqiao Yang
Longxiang Tang
...
Shaozuo Yu
Sitong Wu
Eric Lo
Shu-Lin Liu
Jiaya Jia
AuLLM
103
6
0
12 Dec 2024
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable
  Compression
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression
Guangda Liu
C. Li
Jieru Zhao
Chenqi Zhang
M. Guo
74
8
0
04 Dec 2024
Unifying KV Cache Compression for Large Language Models with LeanKV
Unifying KV Cache Compression for Large Language Models with LeanKV
Yanqi Zhang
Yuwei Hu
Runyuan Zhao
John C. S. Lui
Haibo Chen
MQ
130
5
0
04 Dec 2024
Squeezed Attention: Accelerating Long Context Length LLM Inference
Squeezed Attention: Accelerating Long Context Length LLM Inference
Coleman Hooper
Sehoon Kim
Hiva Mohammadzadeh
Monishwaran Maheswaran
June Paik
Michael W. Mahoney
K. K.
Amir Gholami
53
9
0
14 Nov 2024
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
Wei Yu Wu
Zhuoshi Pan
Chao Wang
L. Chen
Y. Bai
Kun Fu
Z. Wang
Hui Xiong
Hui Xiong
LLMAG
34
5
0
05 Nov 2024
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for
  Vision-Language Model Inference Acceleration
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration
Dezhan Tu
Danylo Vashchilenko
Yuzhe Lu
Panpan Xu
VLM
37
9
0
29 Oct 2024
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Hanshi Sun
Li-Wen Chang
Wenlei Bao
Size Zheng
Ningxin Zheng
Xin Liu
Harry Dong
Yuejie Chi
Beidi Chen
VLM
88
16
0
28 Oct 2024
EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models
EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models
Junhao Hu
Wenrui Huang
H. Wang
Weidong Wang
Tiancheng Hu
Qin Zhang
Hao Feng
Xusheng Chen
Yizhou Shan
Tao Xie
RALM
LLMAG
34
2
0
20 Oct 2024
TemporalBench: Benchmarking Fine-grained Temporal Understanding for
  Multimodal Video Models
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
Mu Cai
Reuben Tan
Jianrui Zhang
Bocheng Zou
Kai Zhang
...
Yao Dou
J. Park
Jianfeng Gao
Yong Jae Lee
Jianwei Yang
42
12
0
14 Oct 2024
HSR-Enhanced Sparse Attention Acceleration
HSR-Enhanced Sparse Attention Acceleration
Bo Chen
Yingyu Liang
Zhizhou Sha
Zhenmei Shi
Zhao-quan Song
89
18
0
14 Oct 2024
TidalDecode: Fast and Accurate LLM Decoding with Position Persistent
  Sparse Attention
TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
Lijie Yang
Zhihao Zhang
Zhuofu Chen
Zikun Li
Zhihao Jia
43
4
0
07 Oct 2024
LayerKV: Optimizing Large Language Model Serving with Layer-wise KV
  Cache Management
LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management
Yi Xiong
Hao Wu
Changxu Shao
Ziqing Wang
Rui Zhang
Yuhong Guo
Junping Zhao
Ke Zhang
Zhenxuan Pan
35
4
0
01 Oct 2024
CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling
  Acceleration in LLMs
CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs
Junlin Lv
Yuan Feng
Xike Xie
Xin Jia
Qirong Peng
Guiming Xie
21
3
0
19 Sep 2024
NanoFlow: Towards Optimal Large Language Model Serving Throughput
NanoFlow: Towards Optimal Large Language Model Serving Throughput
Kan Zhu
Yilong Zhao
Liangyu Zhao
Gefei Zuo
Yile Gu
...
Keisuke Kamahori
Chien-Yu Lin
Stephanie Wang
Arvind Krishnamurthy
Baris Kasikci
30
28
0
22 Aug 2024
A Tighter Complexity Analysis of SparseGPT
A Tighter Complexity Analysis of SparseGPT
Xiaoyu Li
Yingyu Liang
Zhenmei Shi
Zhao-quan Song
66
21
0
22 Aug 2024
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding
Jian Chen
Vashisth Tiwari
Ranajoy Sadhukhan
Zhuoming Chen
Jinyuan Shi
Ian En-Hsu Yen
Ian En-Hsu Yen
Avner May
Tianqi Chen
Beidi Chen
LRM
31
22
0
20 Aug 2024
Post-Training Sparse Attention with Double Sparsity
Post-Training Sparse Attention with Double Sparsity
Shuo Yang
Ying Sheng
Joseph E. Gonzalez
Ion Stoica
Lianmin Zheng
28
7
0
11 Aug 2024
Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on
  Long-Context Tasks
Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks
Zheng Wang
Boxiao Jin
Zhongzhi Yu
Minjia Zhang
MoMe
37
23
0
11 Jul 2024
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via
  Dynamic Sparse Attention
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
Huiqiang Jiang
Yucheng Li
Chengruidong Zhang
Qianhui Wu
Xufang Luo
...
Amir H. Abdi
Dongsheng Li
Chin-Yew Lin
Yuqing Yang
L. Qiu
67
82
0
02 Jul 2024
MoA: Mixture of Sparse Attention for Automatic Large Language Model
  Compression
MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression
Tianyu Fu
Haofeng Huang
Xuefei Ning
Genghan Zhang
Boju Chen
...
Shiyao Li
Shengen Yan
Guohao Dai
Huazhong Yang
Yu Wang
MQ
44
17
0
21 Jun 2024
12
Next