ResearchTrend.AI
  • Papers
  • Communities
  • Organizations
  • Events
  • Blog
  • Pricing
  • Feedback
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2509.14161
0
0

CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset

17 September 2025
Brian Yan
Injy Hamed
Shuichiro Shimizu
Vasista Lodagala
William Chen
Olga Iakovenko
Bashar Talafha
Amir Hussein
Alexander Polok
Kalvin Chang
Dominik Klement
Sara Althubaiti
Puyuan Peng
Matthew Wiesner
Thamar Solorio
Ahmed M. Ali
Sanjeev Khudanpur
Shinji Watanabe
Chih-Chen Chen
Zhen Wu
Karim Benharrak
Anuj Diwan
Samuele Cornell
Eunjung Yeo
Kwanghee Choi
Carlos Carvalho
Karen Rosero
ArXiv (abs)PDFHTML
Main:4 Pages
2 Figures
Bibliography:1 Pages
9 Tables
Abstract

We present CS-FLEURS, a new dataset for developing and evaluating code-switched speech recognition and translation systems beyond high-resourced languages. CS-FLEURS consists of 4 test sets which cover in total 113 unique code-switched language pairs across 52 languages: 1) a 14 X-English language pair set with real voices reading synthetically generated code-switched sentences, 2) a 16 X-English language pair set with generative text-to-speech 3) a 60 {Arabic, Mandarin, Hindi, Spanish}-X language pair set with the generative text-to-speech, and 4) a 45 X-English lower-resourced language pair test set with concatenative text-to-speech. Besides the four test sets, CS-FLEURS also provides a training set with 128 hours of generative text-to-speech data across 16 X-English language pairs. Our hope is that CS-FLEURS helps to broaden the scope of future code-switched speech research. Dataset link:this https URL.

View on arXiv
Comments on this paper