Large Language Models (LLMs) have grown increasingly expensive to deploy, driving the need for effective model compression techniques. While block pruning offers a straightforward approach to reducing model size, existing methods often struggle to maintain performance or require substantial computational resources for recovery. We present IteRABRe, a simple yet effective iterative pruning method that achieves superior compression results while requiring minimal computational resources. Using only 2.5M tokens for recovery, our method outperforms baseline approaches by ~3% on average when compressing the Llama3.1-8B and Qwen2.5-7B models. IteRABRe demonstrates particular strength in the preservation of linguistic capabilities, showing an improvement 5% over the baselines in language-related tasks. Our analysis reveals distinct pruning characteristics between these models, while also demonstrating preservation of multilingual capabilities.
View on arXiv@article{wibowo2025_2503.06291, title={ IteRABRe: Iterative Recovery-Aided Block Reduction }, author={ Haryo Akbarianto Wibowo and Haiyue Song and Hideki Tanaka and Masao Utiyama and Alham Fikri Aji and Raj Dabre }, journal={arXiv preprint arXiv:2503.06291}, year={ 2025 } }