v1v2v3 (latest)

SoulX-FlashHead: Oracle-guided Generation of Infinite Real-time Streaming Talking Heads

7 February 2026

Tan Yu

Qian Qiao

Le Shen

Ke Zhou

Jincheng Hu

Dian Sheng

Bo Hu

Haoming Qin

Jun Gao

Changhai Zhou

Shunshun Yin

Siyuan Liu

VGen

3DGS

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)Github

Main:8 Pages

3 Figures

Bibliography:3 Pages

4 Tables

Abstract

Achieving a balance between high-fidelity visual quality and low-latency streaming remains a formidable challenge in audio-driven portrait generation. Existing large-scale models often suffer from prohibitive computational costs, while lightweight alternatives typically compromise on holistic facial representations and temporal stability. In this paper, we propose SoulX-FlashHead, a unified 1.3B-parameter framework designed for real-time, infinite-length, and high-fidelity streaming video generation. To address the instability of audio features in streaming scenarios, we introduce Streaming-Aware Spatiotemporal Pre-training equipped with a Temporal Audio Context Cache mechanism, which ensures robust feature extraction from short audio fragments. Furthermore, to mitigate the error accumulation and identity drift inherent in long-sequence autoregressive generation, we propose Oracle-Guided Bidirectional Distillation, leveraging ground-truth motion priors to provide precise physical guidance. We also present VividHead, a large-scale, high-quality dataset containing 782 hours of strictly aligned footage to support robust training. Extensive experiments demonstrate that SoulX-FlashHead achieves state-of-the-art performance on HDTF and VFHQ benchmarks. Notably, our Lite variant achieves an inference speed of 96 FPS on a single NVIDIA RTX 4090, facilitating ultra-fast interaction without sacrificing visual coherence.

View on arXiv

Comments on this paper