KinMo: Kinematic-aware Human Motion Understanding and Generation

Current human motion synthesis frameworks rely on global action descriptions, creating a modality gap that limits both motion understanding and generation capabilities. A single coarse description, such as ``run", fails to capture details like variations in speed, limb positioning, and kinematic dynamics, leading to ambiguities between text and motion modalities. To address this challenge, we introduce \textbf{KinMo}, a unified framework built on a hierarchical describable motion representation that extends beyond global action by incorporating kinematic group movements and their interactions. We design an automated annotation pipeline to generate high-quality, fine-grained descriptions for this decomposition, resulting in the KinMo dataset. To leverage these structured descriptions, we propose Hierarchical Text-Motion Alignment, improving spatial understanding by integrating additional motion details. Furthermore, we introduce a coarse-to-fine generation procedure to leverage enhanced spatial understanding to improve motion synthesis. Experimental results show that KinMo significantly improves motion understanding, demonstrated by enhanced text-motion retrieval performance and enabling more fine-grained motion generation and editing capabilities. Project Page:this https URL
View on arXiv@article{zhang2025_2411.15472, title={ KinMo: Kinematic-aware Human Motion Understanding and Generation }, author={ Pengfei Zhang and Pinxin Liu and Hyeongwoo Kim and Pablo Garrido and Bindita Chaudhuri }, journal={arXiv preprint arXiv:2411.15472}, year={ 2025 } }