Monocular 3D Human Pose Estimation In The Wild Using Improved CNN
Supervision
- 3DH
We propose a CNN-based approach for 3D human body pose estimation from single RGB images, that addresses the issue of limited generalizability of models trained solely on the starkly limited publicly available 3D pose data. We propose novel CNN supervision techniques, using a regularization structure while training that extends the concept of multi-level skip connections, and leverage first and second order parent relationships along the skeletal kinematic tree to learn better representations. We introduce a new training set for human body pose estimation from monocular images of real humans, that has the ground truth captured with a multi-camera marker-less motion capture system. It complements existing corpora with greater diversity in pose, human appearance, clothing, occlusion, and viewpoints, and enables an increased scope of augmentation. We also contribute a new benchmark that covers outdoor and indoor scenes. We further combine it with transfer learning from 2D pose human pose prediction to achieve even better generalization, and improve over the state-of-the-art on standard benchmarks by more than 25%. We argue that the use of transfer learning of representations in tandem with algorithmic and data contributions is crucial for general progress along many different dimensions of the problem.
View on arXiv