HALO: Hardware-aware quantization with low critical-path-delay weights for LLM acceleration
Main:7 Pages
14 Figures
Bibliography:2 Pages
2 Tables
Abstract
Quantization is critical for efficiently deploying large language models (LLMs). Yet conventional methods remain hardware-agnostic, limited to bit-width constraints, and do not account for intrinsic circuit characteristics such as the timing behaviors and energy profiles of Multiply-Accumulate (MAC) units. This disconnect from circuit-level behavior limits the ability to exploit available timing margins and energy-saving opportunities, reducing the overall efficiency of deployment on modern accelerators.
View on arXivComments on this paper
