180
v1v2 (latest)

ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms

Main:8 Pages
4 Figures
Bibliography:4 Pages
4 Tables
Appendix:4 Pages
Abstract

Large language models require massive memory footprints, severely limiting deployment on consumer hardware. Quantization reduces memory through lower numerical precision, but extreme 2-bit quantization suffers from catastrophic performance loss due to outliers in activations. Rotation-based methods such as QuIP and QuaRot apply orthogonal transforms to eliminate outliers before quantization, using computational invariance: y=Wx=(WQT)(Qx)\mathbf{y} = \mathbf{Wx} = (\mathbf{WQ}^T)(\mathbf{Qx}) for orthogonal Q\mathbf{Q}. However, these methods use fixed transforms--Hadamard matrices achieving optimal worst-case coherence μ=1/n\mu = 1/\sqrt{n}--that cannot adapt to specific weight distributions. We identify that different transformer layers exhibit distinct outlier patterns, motivating layer-adaptive rotations rather than one-size-fits-all approaches. In this work, we propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. Unlike Hadamard's discrete {+1,1}\{+1, -1\} entries that are non-differentiable and thus prohibit gradient-based learning, butterfly transforms' continuous parameterization enables smooth optimization while guaranteeing orthogonality by construction. This orthogonal constraint ensures theoretical guarantees in outlier suppression while achieving O(nlogn)O(n \log n) computational complexity with only nlogn2\frac{n \log n}{2} learnable parameters. We further introduce a uniformity regularization on post-transformation activations to promote smoother distributions amenable to quantization. Learning requires only 128 calibration samples and converges in minutes on a single GPU--a negligible one-time cost. For LLaMA-2-7B with 2-bit quantization, ButterflyQuant achieves 15.4 perplexity versus 37.3 for QuIP. \href{this https URL}{Codes} are available.

View on arXiv
Comments on this paper