Large language models have significantly advanced natural language processing, yet their heavy resource demands pose severe challenges regarding hardware accessibility and energy consumption. This paper presents a focused and high-level review of post-training quantization (PTQ) techniques designed to optimize the inference efficiency of LLMs by the end-user, including details on various quantization schemes, granularities, and trade-offs. The aim is to provide a balanced overview between the theory and applications of post-training quantization.
View on arXiv@article{jørgensen2025_2505.08620, title={ Resource-Efficient Language Models: Quantization for Fast and Accessible Inference }, author={ Tollef Emil Jørgensen }, journal={arXiv preprint arXiv:2505.08620}, year={ 2025 } }