8
9

Deep Learning Meets Projective Clustering

Abstract

A common approach for compressing NLP networks is to encode the embedding layer as a matrix ARn×dA\in\mathbb{R}^{n\times d}, compute its rank-jj approximation AjA_j via SVD, and then factor AjA_j into a pair of matrices that correspond to smaller fully-connected layers to replace the original embedding layer. Geometrically, the rows of AA represent points in Rd\mathbb{R}^d, and the rows of AjA_j represent their projections onto the jj-dimensional subspace that minimizes the sum of squared distances ("errors") to the points. In practice, these rows of AA may be spread around k>1k>1 subspaces, so factoring AA based on a single subspace may lead to large errors that turn into large drops in accuracy. Inspired by \emph{projective clustering} from computational geometry, we suggest replacing this subspace by a set of kk subspaces, each of dimension jj, that minimizes the sum of squared distances over every point (row in AA) to its \emph{closest} subspace. Based on this approach, we provide a novel architecture that replaces the original embedding layer by a set of kk small layers that operate in parallel and are then recombined with a single fully-connected layer. Extensive experimental results on the GLUE benchmark yield networks that are both more accurate and smaller compared to the standard matrix factorization (SVD). For example, we further compress DistilBERT by reducing the size of the embedding layer by 40%40\% while incurring only a 0.5%0.5\% average drop in accuracy over all nine GLUE tasks, compared to a 2.8%2.8\% drop using the existing SVD approach. On RoBERTa we achieve 43%43\% compression of the embedding layer with less than a 0.8%0.8\% average drop in accuracy as compared to a 3%3\% drop previously. Open code for reproducing and extending our results is provided.

View on arXiv
Comments on this paper