61

Efficient 11-bit tensor approximations

Abstract

We present a spatially efficient decomposition of matrices and arbitrary-order tensors as linear combinations of tensor products of {1,1}\{-1, 1\}-valued vectors. For any matrix ARm×nA \in \mathbb{R}^{m \times n}, ARw=SwCwTw=j=1wcjsjtjA - R_w = S_w C_w T_w^\top = \sum_{j=1}^w c_j \cdot \mathbf{s}_j \mathbf{t}_j^\top is a {\it ww-width signed cut decomposition of AA}. Here Cw="diag"(cw)C_w = "diag"(\mathbf{c}_w) for some cwRw,\mathbf{c}_w \in \mathbb{R}^w, and Sw,TwS_w, T_w, and the vectors sj,tj\mathbf{s}_j, \mathbf{t}_j are {1,1}\{-1, 1\}-valued. To store (Sw,Tw,Cw)(S_w, T_w, C_w), we may pack w(m+n)w \cdot (m + n) bits, and require only ww floating point numbers. As a function of ww, RwF\|R_w\|_F exhibits exponential decay when applied to #f32 matrices with i.i.d. N(0,1)\mathcal N (0, 1) entries. Choosing ww so that (Sw,Tw,Cw)(S_w, T_w, C_w) has the same memory footprint as a \textit{f16} or \textit{bf16} matrix, the relative error is comparable. Our algorithm yields efficient signed cut decompositions in 2020 lines of pseudocode. It reflects a simple modification from a celebrated 1999 paper [1] of Frieze and Kannan. As a first application, we approximate the weight matrices in the open \textit{Mistral-7B-v0.1} Large Language Model to a 50%50\% spatial compression. Remarkably, all 226226 remainder matrices have a relative error <6%<6\% and the expanded model closely matches \textit{Mistral-7B-v0.1} on the {\it huggingface} leaderboard [2]. Benchmark performance degrades slowly as we reduce the spatial compression from 50%50\% to 25%25\%. We optimize our open source \textit{rust} implementation [3] with \textit{simd} instructions on \textit{avx2} and \textit{avx512} architectures. We also extend our algorithm from matrices to tensors of arbitrary order and use it to compress a picture of the first author's cat Angus.

View on arXiv
Comments on this paper