85
v1v2 (latest)

A Free Lunch in LLM Compression: Revisiting Retraining after Pruning

Main:9 Pages
5 Figures
Bibliography:2 Pages
18 Tables
Appendix:7 Pages
Abstract

Post-training pruning substantially reduces inference costs but often causes severe quality degradation without adapting the remaining weights. For LLMs, such retraining is commonly considered impractical due to large computational costs, motivating increasingly sophisticated pruning criteria to compensate by selecting better sparsity patterns. In this work, we revisit post-pruning adaptation and study local reconstruction: adapting only a small pruned submodel at a time using a small calibration set by matching intermediate activations of the dense model. We conduct a large-scale study across model families and scales (up to 72B parameters) and establish three central results. First, local reconstruction is an effective adaptation mechanism for LLMs, matching post-pruning PEFT while using over an order of magnitude less data and compute. Second, we identify a broad "free lunch" regime in reconstruction granularity: across a wide range of submodel sizes, final quality remains essentially unchanged, allowing granularity to be chosen based on memory constraints. Finally, with reconstruction, the pruning criterion becomes less critical: performance gaps between sophisticated methods and simple baselines shrink with model size, making simple methods competitive again. Collectively, our results challenge the prevailing narrative that post-pruning adaptation is impractical for LLMs.

View on arXiv
Comments on this paper