UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models

Previous studies on robotic manipulation are based on a limited understanding of the underlying 3D motion constraints and affordances. To address these challenges, we propose a comprehensive paradigm, termed UniAff, that integrates 3D object-centric manipulation and task understanding in a unified formulation. Specifically, we constructed a dataset labeled with manipulation-related key attributes, comprising 900 articulated objects from 19 categories and 600 tools from 12 categories. Furthermore, we leverage MLLMs to infer object-centric representations for manipulation tasks, including affordance recognition and reasoning about 3D motion constraints. Comprehensive experiments in both simulation and real-world settings indicate that UniAff significantly improves the generalization of robotic manipulation for tools and articulated objects. We hope that UniAff will serve as a general baseline for unified robotic manipulation tasks in the future. Images, videos, dataset, and code are published on the project website at:this https URL
View on arXiv@article{yu2025_2409.20551, title={ UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models }, author={ Qiaojun Yu and Siyuan Huang and Xibin Yuan and Zhengkai Jiang and Ce Hao and Xin Li and Haonan Chang and Junbo Wang and Liu Liu and Hongsheng Li and Peng Gao and Cewu Lu }, journal={arXiv preprint arXiv:2409.20551}, year={ 2025 } }