FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence InferenceInternational Conference on Learning Representations (ICLR), 2025 |
Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-ProbingInternational Conference on Learning Representations (ICLR), 2025 |
Squeezed Attention: Accelerating Long Context Length LLM InferenceAnnual Meeting of the Association for Computational Linguistics (ACL), 2024 |
AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise
Asymmetric Quantization ConfigurationsInternational Conference on Computational Linguistics (COLING), 2024 |