Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs

21 February 2025

Papers citing "Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs"

5 / 5 papers shown

Title
Hardware-Efficient Attention for Fast Decoding Ted Zadouri Hubert Strauss Tri Dao 66 2 0 27 May 2025
Understanding Differential Transformer Unchains Pretrained Self-Attentions Chaerin Kong Jiho Jang Nojun Kwak 88 0 0 22 May 2025
A3 : an Analytical Low-Rank Approximation Framework for Attention Jeffrey T. H. Wong Cheng Zhang Xinye Cao Pedro Gimenes George A. Constantinides Wayne Luk Yiren Zhao OffRL MQ 133 1 0 19 May 2025
FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs Pencuo Zeren Qiuming Luo Rui Mao Chang Kong 24 0 0 13 May 2025
Effective Length Extrapolation via Dimension-Wise Positional Embeddings Manipulation Yi Lu Wanxu Zhao Xin Zhou Chenxin An Cong Wang ... Jun Zhao Tao Ji Tao Gui Qi Zhang Xuanjing Huang 102 0 0 26 Apr 2025