Hi5: Synthetic Data for Inclusive, Robust, Hand Pose Estimation
- DiffM3DH
Hand pose estimation plays a vital role in capturing subtle nonverbal cues essential for understanding human affect. However, collecting diverse, expressive real-world data remains challenging due to labor-intensive manual annotation that often underrepresents demographic diversity and natural expressions. To address this issue, we introduce a cost-effective approach to generating synthetic data using high-fidelity 3D hand models and a wide range of affective hand poses. Our method includes varied skin tones, genders, dynamic environments, realistic lighting conditions, and diverse naturally occurring gesture animations. The resulting dataset, Hi5, contains 583,000 pose-annotated images, carefully balanced to reflect natural diversity and emotional expressiveness. Models trained exclusively on Hi5 achieve performance comparable to human-annotated datasets, exhibiting superior robustness to occlusions and consistent accuracy across diverse skin tones -- which is crucial for reliably recognizing expressive gestures in affective computing applications. Our results demonstrate that synthetic data effectively addresses critical limitations of existing datasets, enabling more inclusive, expressive, and reliable gesture recognition systems while achieving competitive performance in pose estimation benchmarks. The Hi5 dataset, data synthesis pipeline, source code, and game engine project are publicly released to support further research in synthetic hand-gesture applications.
View on arXiv