v1v2v3 (latest)

QET: Enhancing Quantized LLM Parameters and KV cache Compression through Element Substitution and Residual Clustering

4 July 2024

Abstract

Matrix quantization compresses matrix elements into a more compact form to reduce storage requirements, with dequantization enabling reconstruction for use. We define the Quantization Error Minimization (QEM) problem as minimizing the difference between the original and quantized matrices while ensuring the quantized matrix remains within fixed memory constraints. This technique is crucial in applications like Large Language Model (LLM) weight compression and KV cache compression, where large matrix sizes demand efficient storage solutions. As modern LLMs like GPT-4 and BERT continue to grow, effective matrix compression is increasingly important. These models contain billions of parameters in matrix form, making efficient weight quantization essential for both storage and computational efficiency. Similarly, KV caches, storing intermediate inference results, are matrix-based and benefit significantly from optimized compression techniques. To address the QEM problem in the context of LLM weight and KV cache compression, we propose Quantum Entanglement Trees (QET). QET leverages the local structure of matrix elements by iteratively swapping elements to create a locally ordered matrix, which is then grouped and quantized column by column. To enhance QET, we introduce two optimizations: residual quantization to further reduce Mean Squared Error (MSE) and masking with batch processing to accelerate the algorithm. Our experiments demonstrate that QET can reduce MSE to 12.3% of its original value at the same compression ratio, outperforming leading baseline methods. Our contributions include framing the QEM problem specifically for LLM and KV cache compression, developing the QET algorithm, and implementing optimizations that improve accuracy and processing speed.

View on arXiv

Comments on this paper