20

EvalTalker: Learning to Evaluate Real-Portrait-Driven Multi-Subject Talking Humans

Yingjie Zhou
Xilei Zhu
Siyu Ren
Ziyi Zhao
Ziwen Wang
Farong Wen
Yu Zhou
Jiezhang Cao
Xiongkuo Min
Fengjiao Chen
Xiaoyu Li
Xuezhi Cao
Guangtao Zhai
Xiaohong Liu
Abstract

Speech-driven Talking Human (TH) generation, commonly known as "Talker," currently faces limitations in multi-subject driving capabilities. Extending this paradigm to "Multi-Talker," capable of animating multiple subjects simultaneously, introduces richer interactivity and stronger immersion in audiovisual communication. However, current Multi-Talkers still exhibit noticeable quality degradation caused by technical limitations, resulting in suboptimal user experiences. To address this challenge, we construct THQA-MT, the first large-scale Multi-Talker-generated Talking Human Quality Assessment dataset, consisting of 5,492 Multi-Talker-generated THs (MTHs) from 15 representative Multi-Talkers using 400 real portraits collected online. Through subjective experiments, we analyze perceptual discrepancies among different Multi-Talkers and identify 12 common types of distortion. Furthermore, we introduce EvalTalker, a novel TH quality assessment framework. This framework possesses the ability to perceive global quality, human characteristics, and identity consistency, while integrating Qwen-Sync to perceive multimodal synchrony. Experimental results demonstrate that EvalTalker achieves superior correlation with subjective scores, providing a robust foundation for future research on high-quality Multi-Talker generation and evaluation.

View on arXiv
Main:8 Pages
5 Figures
Bibliography:3 Pages
5 Tables
Comments on this paper