43
v1v2v3 (latest)

K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function

Shuhe Li
Chenxu Guo
Jiachen Lian
Cheol Jun Cho
Wenshuo Zhao
Xiner Xu
Ruiyu Jin
Xiaoyu Shi
Xuanru Zhou
Dingkun Zhou
Sam Wang
Grace Wang
Jingze Yang
Jingyi Xu
Ruohan Bao
Xingrui Chen
Elise Brenner
Brandon In
Francesca Pei
Maria Luisa Gorno-Tempini
Gopala Anumanchipalli
Main:4 Pages
1 Figures
Bibliography:1 Pages
3 Tables
Abstract

Evaluating young children's language is challenging for automatic speech recognizers due to high-pitched voices, prolonged sounds, and limited data. We introduce K-Function, a framework that combines accurate sub-word transcription with objective, Large Language Model (LLM)-driven scoring. Its core, Kids-Weighted Finite State Transducer (K-WFST), merges an acoustic phoneme encoder with a phoneme-similarity model to capture child-specific speech errors while remaining fully interpretable. K-WFST achieves a 1.39 % phoneme error rate on MyST and 8.61 % on Multitudes-an absolute improvement of 10.47 % and 7.06 % over a greedy-search decoder. These high-quality transcripts are used by an LLM to grade verbal skills, developmental milestones, reading, and comprehension, with results that align closely with human evaluators. Our findings show that precise phoneme recognition is essential for creating an effective assessment framework, enabling scalable language screening for children.

View on arXiv
Comments on this paper