K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function

3 July 2025

Shuhe Li

Chenxu Guo

Jiachen Lian

Cheol Jun Cho

Wenshuo Zhao

Xuanru Zhou

Dingkun Zhou

Sam Wang

Grace Wang

Jingze Yang

Jingyi Xu

Ruohan Bao

Elise Brenner

Brandon In

Francesca Pei

Maria Luisa Gorno-Tempini

Gopala Anumanchipalli

ArXiv (abs)PDF HTML Github

Main:4 Pages

1 Figures

Bibliography:1 Pages

3 Tables

Abstract

Early evaluation of children's language is frustrated by the high pitch, long phones, and sparse data that derail automatic speech recognisers. We introduce K-Function, a unified framework that combines accurate sub-word transcription, objective scoring, and actionable feedback. Its core, Kids-WFST, merges a Wav2Vec2 phoneme encoder with a phoneme-similarity Dysfluent-WFST to capture child-specific errors while remaining fully interpretable. Kids-WFST attains 1.39% phoneme error on MyST and 8.61% on Multitudes--absolute gains of 10.47 and 7.06 points over a greedy-search decoder. These high-fidelity transcripts power an LLM that grades verbal skills, milestones, reading, and comprehension, aligning with human proctors and supplying tongue-and-lip visualizations plus targeted advice. The results show that precise phoneme recognition cements a complete diagnostic-feedback loop, paving the way for scalable, clinician-ready language assessment.

View on arXiv

Comments on this paper