8
9

Compressed Deep Networks: Goodbye SVD, Hello Robust Low-Rank Approximation

Abstract

A common technique for compressing a neural network is to compute the kk-rank 2\ell_2 approximation Ak,2A_{k,2} of the matrix ARn×dA\in\mathbb{R}^{n\times d} that corresponds to a fully connected layer (or embedding layer). Here, dd is the number of the neurons in the layer, nn is the number in the next one, and Ak,2A_{k,2} can be stored in O((n+d)k)O((n+d)k) memory instead of O(nd)O(nd). This 2\ell_2-approximation minimizes the sum over every entry to the power of p=2p=2 in the matrix AAk,2A - A_{k,2}, among every matrix Ak,2Rn×dA_{k,2}\in\mathbb{R}^{n\times d} whose rank is kk. While it can be computed efficiently via SVD, the 2\ell_2-approximation is known to be very sensitive to outliers ("far-away" rows). Hence, machine learning uses e.g. Lasso Regression, 1\ell_1-regularization, and 1\ell_1-SVM that use the 1\ell_1-norm. This paper suggests to replace the kk-rank 2\ell_2 approximation by p\ell_p, for p[1,2]p\in [1,2]. We then provide practical and provable approximation algorithms to compute it for any p1p\geq1, based on modern techniques in computational geometry. Extensive experimental results on the GLUE benchmark for compressing BERT, DistilBERT, XLNet, and RoBERTa confirm this theoretical advantage. For example, our approach achieves 28%28\% compression of RoBERTa's embedding layer with only 0.63%0.63\% additive drop in the accuracy (without fine-tuning) in average over all tasks in GLUE, compared to 11%11\% drop using the existing 2\ell_2-approximation. Open code is provided for reproducing and extending our results.

View on arXiv
Comments on this paper