
![]() Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token RecyclingAnnual Meeting of the Association for Computational Linguistics (ACL), 2024 |
![]() KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft
Heads with Adversarial LearningInternational Conference on Computer Supported Cooperative Work in Design (CSCWD), 2024 |
![]() Kraken: Inherently Parallel Transformers For Efficient Multi-Device
InferenceNeural Information Processing Systems (NeurIPS), 2024 |
![]() Eigen Attention: Attention in Low-Rank Space for KV Cache CompressionConference on Empirical Methods in Natural Language Processing (EMNLP), 2024 |
![]() NACL: A General and Effective KV Cache Eviction Framework for LLMs at
Inference TimeAnnual Meeting of the Association for Computational Linguistics (ACL), 2024 |
![]() Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in
Transformers Qian Chen Wen Wang Qinglin Zhang Siqi Zheng Shiliang Zhang Chong Deng Hai Yu Jiaqing Liu Yukun Ma Chong Zhang |