PoolingVQ: A VQVAE Variant for Reducing Audio Redundancy and Boosting Multi-Modal Fusion in Music Emotion Analysis
Multimodal music emotion analysis leverages both audio and MIDI modalities to enhance performance. While mainstream approaches focus on complex feature extraction networks, we propose that shortening the length of audio sequence features to mitigate redundancy, especially in contrast to MIDI's compact representation, may effectively boost task performance. To achieve this, we developed PoolingVQ by combining Vector Quantized Variational Autoencoder (VQVAE) with spatial pooling, which directly compresses audio feature sequences through codebook-guided local aggregation to reduce redundancy, then devised a two-stage co-attention approach to fuse audio and MIDI information. Experimental results on the public datasets EMOPIA and VGMIDI demonstrate that our multimodal framework achieves state-of-the-art performance, with PoolingVQ yielding effective improvement. Our proposed metho's code is available at Anonymous GitHub
View on arXiv