From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM

13 March 2025

Abstract

Large language models (LLMs) have shown remarkable performance and generalization capabilities across multiple languages and tasks, making them very attractive targets for multi-modality integration (e.g., images or speech). In this work, we extend an existing LLM to the speech modality via speech discretization and continued pre-training. In particular, we are interested in multilingual LLMs, such as TOWER, as their pre-training setting allows us to treat discretized speech input as an additional translation language. The resulting open-source model, SPIRE, is able to transcribe and translate English speech input while maintaining TOWER's original performance on translation-related tasks, showcasing that discretized speech input integration as an additional language is feasible during LLM adaptation. We make our code and models available to the community.

View on arXiv

@article{ambilduke2025_2503.10620,
  title={ From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM },
  author={ Kshitij Ambilduke and Ben Peters and Sonal Sannigrahi and Anil Keshwani and Tsz Kin Lam and Bruno Martins and Marcely Zanon Boito and André F.T. Martins },
  journal={arXiv preprint arXiv:2503.10620},
  year={ 2025 }
}

Comments on this paper