v1v2 (latest)

Context-Driven Performance Modeling for Causal Inference Operators on Neural Processing Units

29 September 2025

ArXiv (abs)PDF HTML Github

Main:10 Pages

8 Figures

Bibliography:1 Pages

10 Tables

Abstract

The proliferation of large language models has driven demand for long-context inference on resource-constrained edge platforms. However, deploying these models on Neural Processing Units (NPUs) presents significant challenges due to architectural mismatch: the quadratic complexity of standard attention conflicts with NPU memory and compute patterns. This paper presents a comprehensive performance analysis of causal inference operators on a modern NPU, benchmarking quadratic attention against sub-quadratic alternatives including structured state-space models and causal convolutions. Our analysis reveals a spectrum of critical bottlenecks: quadratic attention becomes severely memory-bound with catastrophic cache inefficiency, while sub-quadratic variants span from compute-bound on programmable vector cores to memory-bound by data movement. These findings provide essential insights for co-designing hardware-aware models and optimization strategies to enable efficient long-context inference on edge platforms.

View on arXiv

Comments on this paper