Advanced Gesture Recognition for Autism Spectrum Disorder Detection: Integrating YOLOv7, Video Augmentation, and VideoMAE for Naturalistic Video Analysis
Deep learning and contactless sensing technologies have significantly advanced the automated assessment of human behaviors in healthcare. In the context of autism spectrum disorder (ASD), repetitive motor behaviors such as spinning, head banging, and arm flapping are key indicators for diagnosis. This study focuses on distinguishing between children with ASD and typically developed (TD) peers by analyzing videos captured in natural, uncontrolled environments. Using the publicly available Self-Stimulatory Behavior Dataset (SSBD), we address the classification task as a binary problem, ASD vs. TD, based on stereotypical repetitive gestures. We adopt a pipeline integrating YOLOv7-based detection, extensive video augmentations, and the VideoMAE framework, which efficiently captures both spatial and temporal features through a high-ratio masking and reconstruction strategy. Our proposed approach achieves 95% accuracy, 0.93 precision, 0.94 recall, and 0.94 F1 score, surpassing the previous state-of-the-art by a significant margin. These results demonstrate the effectiveness of combining advanced object detection, robust data augmentation, and masked autoencoder-based video modeling for reliable ASD vs. TD classification in naturalistic settings.
View on arXiv