Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding

Referring expression comprehension (REC) aims at achieving object localization based on natural language descriptions. However, existing REC approaches are constrained by object category descriptions and single-attribute intention descriptions, hindering their application in real-world scenarios. In natural human-robot interactions, users often express their desires through individual states and intentions, accompanied by guiding gestures, rather than detailed object descriptions. To address this challenge, we propose Multi-ref EC, a novel task framework that integrates state descriptions, derived intentions, and embodied gestures to locate target objects. We introduce the State-Intention-Gesture Attributes Reference (SIGAR) dataset, which combines state and intention expressions with embodied references. Through extensive experiments with various baseline models on SIGAR, we demonstrate that properly ordered multi-attribute references contribute to improved localization performance, revealing that single-attribute reference is insufficient for natural human-robot interaction scenarios. Our findings underscore the importance of multi-attribute reference expressions in advancing visual-language understanding.
View on arXiv@article{guo2025_2503.19240, title={ Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding }, author={ Hao Guo and Jianfei Zhu and Wei Fan and Chunzhi Yi and Feng Jiang }, journal={arXiv preprint arXiv:2503.19240}, year={ 2025 } }