v1v2 (latest)

Bench360: Benchmarking Local LLM Inference from 360 Degrees

12 November 2025

Linus Stuhlmann

Mauricio Fadel Argerich

Jonathan Fürst

ELM

ArXiv (abs)PDF HTML Github (7521★)

Main:8 Pages

5 Figures

Bibliography:4 Pages

9 Tables

Appendix:4 Pages

Abstract

Running LLMs locally has become increasingly common, but users face a complex design space across models, quantization levels, inference engines, and serving scenarios. Existing inference benchmarks are fragmented and focus on isolated goals, offering little guidance for practical deployments. We present Bench360, a framework for evaluating local LLM inference across tasks, usage patterns, and system metrics in one place. Bench360 supports custom tasks, integrates multiple inference engines and quantization formats, and reports both task quality and system behavior (latency, throughput, energy, startup time). We demonstrate it on four NLP tasks across three GPUs and four engines, showing how design choices shape efficiency and output quality. Results confirm that tradeoffs are substantial and configuration choices depend on specific workloads and constraints. There is no universal best option, underscoring the need for comprehensive, deployment-oriented benchmarks.

View on arXiv

Comments on this paper