IteRABRe: Iterative Recovery-Aided Block Reduction

8 March 2025

Abstract

Large Language Models (LLMs) have grown increasingly expensive to deploy, driving the need for effective model compression techniques. While block pruning offers a straightforward approach to reducing model size, existing methods often struggle to maintain performance or require substantial computational resources for recovery. We present IteRABRe, a simple yet effective iterative pruning method that achieves superior compression results while requiring minimal computational resources. Using only 2.5M tokens for recovery, our method outperforms baseline approaches by ~3% on average when compressing the Llama3.1-8B and Qwen2.5-7B models. IteRABRe demonstrates particular strength in the preservation of linguistic capabilities, showing an improvement 5% over the baselines in language-related tasks. Our analysis reveals distinct pruning characteristics between these models, while also demonstrating preservation of multilingual capabilities.

View on arXiv

@article{wibowo2025_2503.06291,
  title={ IteRABRe: Iterative Recovery-Aided Block Reduction },
  author={ Haryo Akbarianto Wibowo and Haiyue Song and Hideki Tanaka and Masao Utiyama and Alham Fikri Aji and Raj Dabre },
  journal={arXiv preprint arXiv:2503.06291},
  year={ 2025 }
}

Comments on this paper