An Efficient and Effective Transformer Decoder-Based Framework for
Multi-Task Visual GroundingEuropean Conference on Computer Vision (ECCV), 2024 |
Look Hear: Gaze Prediction for Speech-directed Human AttentionEuropean Conference on Computer Vision (ECCV), 2024 |
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model Yuxuan Zhang Tianheng Cheng Lianghui Zhu Lei Liu Heng Liu Longjin Ran Xiaoxin Chen Xiaoxin Chen Wenyu Liu Xinggang Wang |
A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances,
and Future DirectionsIEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2024 |