DECAR: Deep Clustering for learning general-purpose Audio
Representations
- SSL
In this paper, we introduce DECAR (DEep Clustering for learning general-purpose Audio Representations), a self-supervised pre-training approach for learning general-purpose audio representations. Our system is based on clustering: it utilizes an offline clustering step to produce pseudo-labels and trains the network with a classification loss supervised by these pseudo-labels. We develop on top of recent advances in self-supervised learning for computer vision and design a lightweight, easy-to-use, self-supervised pre-training scheme for learning audio representations. We pre-train DECAR embeddings on a balanced subset of the large-scale AudioSet dataset and FSD50K and evaluate our representations on the LAPE Benchmark consisting of 11 downstream classification tasks, including speech, music, animal sounds, and acoustic scenes. Experimental results show that DECAR achieves results competitive to the state-of-the-art on both linear evaluation and transfer learning evaluation paradigms across all the downstream tasks in LAPE and performs better than other prior-art in literature with just 15% of the total amount of data available for pre-training. Furthermore, we conduct ablation studies identifying key design choices and also make all our code and pre-trained models publicly available
View on arXiv