Loki: Low-Rank Keys for Efficient Sparse Attention

Loki: Low-Rank Keys for Efficient Sparse Attention

4 June 2024

Prajwal Singhania

Siddharth Singh

Papers citing "Loki: Low-Rank Keys for Efficient Sparse Attention"

10 / 10 papers shown

Title
SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs Shibo Jie Yehui Tang Kai Han Zhi-Hong Deng Jing Han 87 0 0 20 Mar 2025
Attention Condensation via Sparsity Induced Regularized Training Eli Sason Darya Frolova Boris Nazarov Felix Goldberd 75 0 0 03 Mar 2025
SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention Hong Yankun Li Xing Zhen Hui-Ling Yu Xianzhi Liu Wulong Yuan Mingxuan MQ 78 0 0 24 Feb 2025
Tensor Product Attention Is All You Need Yifan Zhang Yifeng Liu Huizhuo Yuan Zhen Qin Yang Yuan Q. Gu Andrew Chi-Chih Yao 67 8 0 11 Jan 2025
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval Di Liu Meng Chen Baotong Lu Huiqiang Jiang Zhenhua Han ... K. Zhang C. L. P. Chen Fan Yang Y. Yang Lili Qiu 39 29 0 03 Jan 2025
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference Hanshi Sun Li-Wen Chang Wenlei Bao Size Zheng Ningxin Zheng Xin Liu Harry Dong Yuejie Chi Beidi Chen VLM 85 16 0 28 Oct 2024
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding Jian Chen Vashisth Tiwari Ranajoy Sadhukhan Zhuoming Chen Jinyuan Shi Ian En-Hsu Yen Ian En-Hsu Yen Avner May Tianqi Chen Beidi Chen LRM 31 21 0 20 Aug 2024
ThinK: Thinner Key Cache by Query-Driven Pruning Yuhui Xu Zhanming Jie Hanze Dong Lei Wang Xudong Lu Aojun Zhou Amrita Saha Caiming Xiong Doyen Sahoo 58 14 0 30 Jul 2024
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention Ramya Prabhu Ajay Nayak Jayashree Mohan R. Ramjee Ashish Panwar VLM 46 24 0 07 May 2024
ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation Z. Yao Xiaoxia Wu Cheng-rong Li Stephen Youn Yuxiong He MQ 63 56 0 15 Mar 2023