MEMORY-VQ: Compression for Tractable Internet-Scale Memory

North American Chapter of the Association for Computational Linguistics (NAACL), 2023

28 August 2023

Santiago Ontañón

Sumit Sanghai

Joshua Ainslie

RALM

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)

Abstract

Retrieval augmentation is a powerful but expensive method to make language models more knowledgeable about the world. Memory-based methods like LUMEN pre-compute token representations for retrieved passages to drastically speed up inference. However, memory also leads to much greater storage requirements from storing pre-computed representations. We propose MEMORY-VQ, a new method to reduce storage requirements of memory-augmented models without sacrificing performance. Our method uses a vector quantization variational autoencoder (VQ-VAE) to compress token representations. We apply MEMORY-VQ to the LUMEN model to obtain LUMEN-VQ, a memory model that achieves a 16x compression rate with comparable performance on the KILT benchmark. LUMEN-VQ enables practical retrieval augmentation even for extremely large retrieval corpora.

View on arXiv

Comments on this paper