Automatic infant 2D pose estimation from videos: comparing seven deep neural network methods

Automatic markerless estimation of infant posture and motion from ordinary videos carries great potential for movement studies "in the wild", facilitating understanding of motor development and massively increasing the chances of early diagnosis of disorders. There is rapid development of human pose estimation methods in computer vision thanks to advances in deep learning and machine learning. However, these methods are trained on datasets that feature adults in different contexts. This work tests and compares seven popular methods (AlphaPose, DeepLabCut/DeeperCut, Detectron2, HRNet, MediaPipe/BlazePose, OpenPose, and ViTPose) on videos of infants in supine position and in more complex settings. Surprisingly, all methods except DeepLabCut and MediaPipe have competitive performance without additional finetuning, with ViTPose performing best. Next to standard performance metrics (average precision and recall), we introduce errors expressed in the neck-mid-hip (torso length) ratio and additionally study missed and redundant detections, and the reliability of the internal confidence ratings of the different methods, which are relevant for downstream tasks. Among the networks with competitive performance, only AlphaPose could run close to real time (27 fps) on our machine. We provide documented Docker containers or instructions for all the methods we used, our analysis scripts, and the processed data atthis https URLandthis https URL.
View on arXiv@article{gama2025_2406.17382, title={ Automatic infant 2D pose estimation from videos: comparing seven deep neural network methods }, author={ Filipe Gama and Matej Misar and Lukas Navara and Sergiu T. Popescu and Matej Hoffmann }, journal={arXiv preprint arXiv:2406.17382}, year={ 2025 } }