26
0

Robust Photo-Realistic Hand Gesture Generation: from Single View to Multiple View

Abstract

High-fidelity hand gesture generation represents a significant challenge in human-centric generation tasks. Existing methods typically employ single-view 3D MANO mesh-rendered images prior to enhancing gesture generation quality. However, the complexity of hand movements and the inherent limitations of single-view rendering make it difficult to capture complete 3D hand information, particularly when fingers are occluded. The fundamental contradiction lies in the loss of 3D topological relationships through 2D projection and the incomplete spatial coverage inherent to single-view representations. Diverging from single-view prior approaches, we propose a multi-view prior framework, named Multi-Modal UNet-based Feature Encoder (MUFEN), to guide diffusion models in learning comprehensive 3D hand information. Specifically, we extend conventional front-view rendering to include rear, left, right, top, and bottom perspectives, selecting the most information-rich view combination as training priors to address occlusion completion. This multi-view prior with a dedicated dual stream encoder significantly improves the model's understanding of complete hand features. Furthermore, we design a bounding box feature fusion module, which can fuse the gesture localization features and gesture multi-modal features to enhance the location-awareness of the MUFEN features to the gesture-related features. Experiments demonstrate that our method achieves state-of-the-art performance in both quantitative metrics and qualitative evaluations.

View on arXiv
@article{fu2025_2505.10576,
  title={ Robust Photo-Realistic Hand Gesture Generation: from Single View to Multiple View },
  author={ Qifan Fu and Xu Chen and Muhammad Asad and Shanxin Yuan and Changjae Oh and Gregory Slabaugh },
  journal={arXiv preprint arXiv:2505.10576},
  year={ 2025 }
}
Comments on this paper