ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.13155
31
0

ReALLM: A general framework for LLM compression and fine-tuning

21 May 2024
Louis Leconte
Lisa Bedin
Van Minh Nguyen
Eric Moulines
    MQ
ArXivPDFHTML
Abstract

We introduce ReALLM, a novel approach for compression and memory-efficient adaptation of pre-trained language models that encompasses most of the post-training quantization and fine-tuning methods for a budget of <4 bits. Pre-trained matrices are decomposed into a high-precision low-rank component and a vector-quantized latent representation (using an autoencoder). During the fine-tuning step, only the low-rank components are updated. Our results show that pre-trained matrices exhibit different patterns. ReALLM adapts the shape of the encoder (small/large embedding, high/low bit VQ, etc.) to each matrix. ReALLM proposes to represent each matrix with a small embedding on bbb bits and a neural decoder model Dϕ\mathcal{D}_\phiDϕ​ with its weights on bϕb_\phibϕ​ bits. The decompression of a matrix requires only one embedding and a single forward pass with the decoder. Our weight-only quantization algorithm yields the best results on language generation tasks (C4 and WikiText-2) for a budget of 333 bits without any training. With a budget of 222 bits, ReALLM achieves state-of-the art performance after fine-tuning on a small calibration dataset.

View on arXiv
Comments on this paper