Ensemble-Guided Distillation for Compact and Robust Acoustic Scene Classification on Edge Devices

15 December 2025

Hossein Sharify

Behnam Raoufi

Mahdy Ramezani

Khosrow Hajsadeghi

Saeed Bagheri Shouraki

ArXiv (abs)PDF HTML Github

Main:5 Pages

3 Figures

3 Tables

Abstract

We present a compact, quantization-ready acoustic scene classification (ASC) framework that couples an efficient student network with a learned teacher ensemble and knowledge distillation. The student backbone uses stacked depthwise-separable "expand-depthwise-project" blocks with global response normalization to stabilize training and improve robustness to device and noise variability, while a global pooling head yields class logits for efficient edge inference. To inject richer inductive bias, we assemble a diverse set of teacher models and learn two complementary fusion heads: z1, which predicts per-teacher mixture weights using a student-style backbone, and z2, a lightweight MLP that performs per-class logit fusion. The student is distilled from the ensemble via temperature-scaled soft targets combined with hard labels, enabling it to approximate the ensemble's decision geometry with a single compact model. Evaluated on the TAU Urban Acoustic Scenes 2022 Mobile benchmark, our approach achieves state-of-the-art (SOTA) results on the TAU dataset under matched edge-deployment constraints, demonstrating strong performance and practicality for mobile ASC.

View on arXiv

Comments on this paper