ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning

3 July 2025

Junyu Wang

Tianrui Wang

Meng Ge

Longbiao Wang

Jianwu Dang

ArXiv (abs)PDF HTML

Main:4 Pages

2 Figures

Bibliography:1 Pages

3 Tables

Abstract

In recent advancements in audio self-supervised representation learning, the standard Transformer architecture has emerged as the predominant approach, yet its attention mechanism often allocates a portion of attention weights to irrelevant information, potentially impairing the model's discriminative ability. To address this, we introduce a differential attention mechanism, which effectively mitigates ineffective attention allocation through the integration of dual-softmax operations and appropriately tuned differential coefficients. Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks, including audio classification (49.0% mAP on AS-2M, 41.5% mAP on AS20K), keyword spotting (98.3% accuracy on SPC-2), and environmental sound classification (96.1% accuracy on ESC-50). These results highlight ASDA's effectiveness in audio tasks, paving the way for broader applications.

View on arXiv

@article{wang2025_2507.02666,
  title={ ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning },
  author={ Junyu Wang and Tianrui Wang and Meng Ge and Longbiao Wang and Jianwu Dang },
  journal={arXiv preprint arXiv:2507.02666},
  year={ 2025 }
}

Comments on this paper