Compressing Large Language Models with Automated Sub-Network Search

9 October 2024

Aaron Klein

Abstract

Large Language Models (LLMs) demonstrate exceptional reasoning abilities, enabling strong generalization across diverse tasks such as commonsense reasoning and instruction following. However, as LLMs scale, inference costs become increasingly prohibitive, accumulating significantly over their life cycle. In this paper we consider model compression for LLMs to reduce model size while improving downstream task performance. We phrase this as a neural architecture search problem that automatically prunes structural components, such as attention heads, neurons, and layers by searching for the Pareto-optimal set of sub-networks balancing between performance and on-device latency. Compared to state-of-the-art structural pruning approaches and fine-tuned smaller sub-networks extracted from the pre-trained model, our method achieves upto 9.85% improvement on average on 11 diverse downstream tasks, while achieving up to 22% improvement of on-device latency.

View on arXiv

@article{sukthanker2025_2410.06479,
  title={ Compressing Large Language Models with Automated Sub-Network Search },
  author={ Rhea Sanjay Sukthanker and Benedikt Staffler and Frank Hutter and Aaron Klein },
  journal={arXiv preprint arXiv:2410.06479},
  year={ 2025 }
}

Comments on this paper