EmoGene: Audio-Driven Emotional 3D Talking-Head Generation

7 October 2024

Abstract

Audio-driven talking-head generation is a crucial and useful technology for virtual human interaction and film-making. While recent advances have focused on improving image fidelity and lip synchronization, generating accurate emotional expressions remains underexplored. In this paper, we introduce EmoGene, a novel framework for synthesizing high-fidelity, audio-driven video portraits with accurate emotional expressions. Our approach employs a variational autoencoder (VAE)-based audio-to-motion module to generate facial landmarks, which are concatenated with emotional embedding in a motion-to-emotion module to produce emotional landmarks. These landmarks drive a Neural Radiance Fields (NeRF)-based emotion-to-video module to render realistic emotional talking-head videos. Additionally, we propose a pose sampling method to generate natural idle-state (non-speaking) videos for silent audio inputs. Extensive experiments demonstrate that EmoGene outperforms previous methods in generating high-fidelity emotional talking-head videos.

View on arXiv

@article{wang2025_2410.17262,
  title={ EmoGene: Audio-Driven Emotional 3D Talking-Head Generation },
  author={ Wenqing Wang and Yun Fu },
  journal={arXiv preprint arXiv:2410.17262},
  year={ 2025 }
}

Comments on this paper