v1v2 (latest)

Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs

9 March 2025

Umberto Cappellazzo

Minsu Kim

Stavros Petridis

ArXiv (abs)PDF HTML HuggingFace (3 upvotes)

Main:6 Pages

7 Figures

Bibliography:2 Pages

4 Tables

Appendix:1 Pages

Abstract

Audio-Visual Speech Recognition (AVSR) leverages audio and visual modalities to improve robustness in noisy environments. Recent advances in Large Language Models (LLMs) show strong performance in speech recognition, including AVSR. However, the long speech representations lead to high computational costs for LLMs. Prior methods compress inputs before feeding them to LLMs, but high compression often harms accuracy. To address this, we propose Llama-MTSK, the first Matryoshka-based Multimodal LLM for AVSR, which flexibly adapts audio-visual token allocation under varying compute constraints. Inspired by Matryoshka Representation Learning, our model encodes representations at multiple granularities with a single architecture, avoiding the need for separate models. For efficient fine-tuning, we introduce three LoRA-based strategies using global and scale-specific modules. Evaluations on major AVSR datasets show Llama-MTSK matches or outperforms models trained at fixed compression levels.

View on arXiv

Comments on this paper