Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

26 January 2024

Venkatesh Ravichandran

ArXiv (abs)PDF HTML

Abstract

We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM). Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality. We also develop a novel multi-task instruction fine-tuning strategy to further benefit from LLM-encoded knowledge for understanding the tasks and conversational contexts, leading to additional improvements. Our approach demonstrates the potential of combined LLMs and acoustic models for a more natural and conversational interaction between humans and speech-enabled AI agents.

View on arXiv

Comments on this paper