OmniVox: Zero-Shot Emotion Recognition with Omni-LLMs

27 March 2025

Abstract

The use of omni-LLMs (large language models that accept any modality as input), particularly for multimodal cognitive state tasks involving speech, is understudied. We present OmniVox, the first systematic evaluation of four omni-LLMs on the zero-shot emotion recognition task. We evaluate on two widely used multimodal emotion benchmarks: IEMOCAP and MELD, and find zero-shot omni-LLMs outperform or are competitive with fine-tuned audio models. Alongside our audio-only evaluation, we also evaluate omni-LLMs on text only and text and audio. We present acoustic prompting, an audio-specific prompting strategy for omni-LLMs which focuses on acoustic feature analysis, conversation context analysis, and step-by-step reasoning. We compare our acoustic prompting to minimal prompting and full chain-of-thought prompting techniques. We perform a context window analysis on IEMOCAP and MELD, and find that using context helps, especially on IEMOCAP. We conclude with an error analysis on the generated acoustic reasoning outputs from the omni-LLMs.

View on arXiv

@article{murzaku2025_2503.21480,
  title={ OmniVox: Zero-Shot Emotion Recognition with Omni-LLMs },
  author={ John Murzaku and Owen Rambow },
  journal={arXiv preprint arXiv:2503.21480},
  year={ 2025 }
}

Comments on this paper