169

Compressing Large Language Models using Low Rank and Low Precision Decomposition

Abstract

The prohibitive sizes of Large Language Models (LLMs) today make it difficult to deploy them on memory-constrained edge devices. This work introduces CALDERA\rm CALDERA -- a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix W\mathbf{W} by approximating it via a low-rank, low-precision decomposition as WQ+LR\mathbf{W} \approx \mathbf{Q} + \mathbf{L}\mathbf{R}. Here, L\mathbf{L} and R\mathbf{R} are low rank factors, and the entries of Q\mathbf{Q}, L\mathbf{L} and R\mathbf{R} are quantized. The model is compressed by substituting each layer with its Q+LR\mathbf{Q} + \mathbf{L}\mathbf{R} decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, L\mathbf{L} and R\mathbf{R} are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. CALDERA\rm CALDERA obtains this decomposition by formulating it as an optimization problem minQ,L,R(Q+LRW)XF2\min_{\mathbf{Q},\mathbf{L},\mathbf{R}}\lVert(\mathbf{Q} + \mathbf{L}\mathbf{R} - \mathbf{W})\mathbf{X}^\top\rVert_{\rm F}^2, where X\mathbf{X} is the calibration data, and Q,L,R\mathbf{Q}, \mathbf{L}, \mathbf{R} are constrained to be representable using low-precision formats. Theoretical upper bounds on the approximation error of CALDERA\rm CALDERA are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget. Results illustrate that compressing LlaMa-22 77B/7070B and LlaMa-33 88B models obtained using CALDERA\rm CALDERA outperforms existing post-training LLM compression techniques in the regime of less than 2.52.5 bits per parameter. The implementation is available at: \href{https://github.com/pilancilab/caldera}{https://github.com/pilancilab/caldera}.

View on arXiv
Comments on this paper