388

HALO: Hardware-aware quantization with low critical-path-delay weights for LLM acceleration

Main:7 Pages
14 Figures
Bibliography:2 Pages
2 Tables
Abstract

Quantization is critical for efficiently deploying large language models (LLMs). Yet conventional methods remain hardware-agnostic, limited to bit-width constraints, and do not account for intrinsic circuit characteristics such as the timing behaviors and energy profiles of Multiply-Accumulate (MAC) units. This disconnect from circuit-level behavior limits the ability to exploit available timing margins and energy-saving opportunities, reducing the overall efficiency of deployment on modern accelerators.

View on arXiv
Comments on this paper