v1v2v3 (latest)

K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function

3 July 2025

Shuhe Li

Chenxu Guo

Jiachen Lian

Cheol Jun Cho

Wenshuo Zhao

Xiner Xu

Ruiyu Jin

Xiaoyu Shi

Xuanru Zhou

Dingkun Zhou

Sam Wang

Grace Wang

Jingze Yang

Jingyi Xu

Ruohan Bao

Xingrui Chen

Elise Brenner

Brandon In

Francesca Pei

Maria Luisa Gorno-Tempini

Gopala Anumanchipalli

ArXiv (abs)PDF HTML Github

Main:4 Pages

1 Figures

Bibliography:1 Pages

3 Tables

Abstract

Evaluating young children's language is challenging for automatic speech recognizers due to high-pitched voices, prolonged sounds, and limited data. We introduce K-Function, a framework that combines accurate sub-word transcription with objective, Large Language Model (LLM)-driven scoring. Its core, Kids-Weighted Finite State Transducer (K-WFST), merges an acoustic phoneme encoder with a phoneme-similarity model to capture child-specific speech errors while remaining fully interpretable. K-WFST achieves a 1.39 % phoneme error rate on MyST and 8.61 % on Multitudes-an absolute improvement of 10.47 % and 7.06 % over a greedy-search decoder. These high-quality transcripts are used by an LLM to grade verbal skills, developmental milestones, reading, and comprehension, with results that align closely with human evaluators. Our findings show that precise phoneme recognition is essential for creating an effective assessment framework, enabling scalable language screening for children.

View on arXiv

Comments on this paper