ZIPA: A family of efficient models for multilingual phone recognition

29 May 2025

Main:9 Pages

3 Figures

Bibliography:4 Pages

12 Tables

Appendix:5 Pages

Abstract

We present ZIPA, a family of efficient speech models that advances the state-of-the-art performance of crosslinguistic phone recognition. We first curated IPAPack++, a large-scale multilingual speech corpus with 17,132 hours of normalized phone transcriptions and a novel evaluation set capturing unseen languages and sociophonetic variation. With the large-scale training data, ZIPA, including transducer (ZIPA-T) and CTC-based (ZIPA-CR) variants, leverage the efficient Zipformer backbones and outperform existing phone recognition systems with much fewer parameters. Further scaling via noisy student training on 11,000 hours of pseudo-labeled multilingual data yields further improvement. While ZIPA achieves strong performance on benchmarks, error analysis reveals persistent limitations in modeling sociophonetic diversity, underscoring challenges for future research.

View on arXiv

@article{zhu2025_2505.23170,
  title={ ZIPA: A family of efficient models for multilingual phone recognition },
  author={ Jian Zhu and Farhan Samir and Eleanor Chodroff and David R. Mortensen },
  journal={arXiv preprint arXiv:2505.23170},
  year={ 2025 }
}

Comments on this paper