HALO: Hardware-aware quantization with low critical-path-delay weights for LLM acceleration

27 February 2025

Rohan Juneja

ArXiv (abs)PDF HTML Github

Main:7 Pages

14 Figures

Bibliography:2 Pages

2 Tables

Abstract

Quantization is critical for efficiently deploying large language models (LLMs). Yet conventional methods remain hardware-agnostic, limited to bit-width constraints, and do not account for intrinsic circuit characteristics such as the timing behaviors and energy profiles of Multiply-Accumulate (MAC) units. This disconnect from circuit-level behavior limits the ability to exploit available timing margins and energy-saving opportunities, reducing the overall efficiency of deployment on modern accelerators.

View on arXiv

Comments on this paper