v1v2v3 (latest)

Generation and Comprehension of Unambiguous Object Descriptions

7 November 2015

ArXiv (abs)PDF HTML Github (164★)

Papers citing "Generation and Comprehension of Unambiguous Object Descriptions"

50 / 919 papers shown

DIQ-H: Evaluating Hallucination Persistence in VLMs Under Temporal Visual Degradation

149

03 Dec 2025

Mitigating Intra- and Inter-modal Forgetting in Continual Learning of Unified Multimodal Models

267

02 Dec 2025

Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction

Jiazhen Liu

Mingkuan Feng

Long Chen

29 Nov 2025

Qwen3-VL Technical Report

...

1.6K

26 Nov 2025

LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

121

25 Nov 2025

Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving

...

157

24 Nov 2025

Syn-GRPO: Self-Evolving Data Synthesis for MLLM Perception Reasoning

132

24 Nov 2025

Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models

Mark Endo

Serena Yeung-Levy

LRM

238

21 Nov 2025

VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning

238

20 Nov 2025

LIHE: Linguistic Instance-Split Hyperbolic-Euclidean Framework for Generalized Weakly-Supervised Referring Expression ComprehensionConference on Empirical Methods in Natural Language Processing (EMNLP), 2025

152

15 Nov 2025

Fast Reasoning Segmentation for Images and Videos

Yiqing Shen

Mathias Unberath

VLM LRM

156

15 Nov 2025

NOVO: Bridging LLaVA and SAM with Visual-only Prompts for Reasoning Segmentation

Kyung-Yoon Yoon

Yeong-Jun Cho

MLLM VLM

421

10 Nov 2025

When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

...

256

04 Nov 2025

UniSOT: A Unified Framework for Multi-Modality Single Object TrackingIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025

227

03 Nov 2025

LongCat-Flash-Omni Technical Report

...

587

31 Oct 2025

Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation

...

349

28 Oct 2025

PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

362

27 Oct 2025

FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning

144

24 Oct 2025

ARGenSeg: Image Segmentation with Autoregressive Image Generation Model

23 Oct 2025

Vision-Centric Activation and Coordination for Multimodal Large Language Models

359

16 Oct 2025

Spatial Preference Rewarding for MLLMs Spatial Understanding

134

16 Oct 2025

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

156

16 Oct 2025

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

...

177

15 Oct 2025

Detect Anything via Next Point Prediction

211

14 Oct 2025

CoPRS: Learning Positional Prior from Chain-of-Thought for Reasoning Segmentation

117

13 Oct 2025

Unified Open-World Segmentation with Multi-Modal Prompts

106

12 Oct 2025

Vision Language Models: A Survey of 26K Papers

Fengming Lin

3DV VLM

134

10 Oct 2025

Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding

226

10 Oct 2025

LTCA: Long-range Temporal Context Attention for Referring Video Object Segmentation

210

09 Oct 2025

Temporal Prompting Matters: Rethinking Referring Video Object Segmentation

199

08 Oct 2025

Referring Expression Comprehension for Small Objects

146

04 Oct 2025

UGround: Towards Unified Visual Grounding with Unrolled Transformers

161

04 Oct 2025

CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning

196

03 Oct 2025

Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs

...

178

02 Oct 2025

PhraseStereo: The First Open-Vocabulary Stereo Image Segmentation Dataset

108

01 Oct 2025

VIRTUE: Visual-Interactive Text-Image Universal Embedder

146

01 Oct 2025

VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs

214

30 Sep 2025

Expert Merging: Model Merging with Unsupervised Expert Alignment and Importance-Guided Layer Chunking

152

30 Sep 2025

Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding

136

30 Sep 2025

GroundSight: Augmenting Vision-Language Models with Grounding Information and De-hallucination

30 Sep 2025

ColLab: A Collaborative Spatial Progressive Data Engine for Referring Expression Comprehension and Generation

112

28 Sep 2025

CoPatch: Zero-Shot Referring Image Segmentation by Leveraging Untapped Spatial Knowledge in CLIP

160

27 Sep 2025

MIRG-RL: Multi-Image Reasoning and Grounding with Reinforcement Learning

137

26 Sep 2025

VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception

213

25 Sep 2025

GeoRef: Referring Expressions in Geometry via Task Formulation, Synthetic Supervision, and Reinforced MLLM-based Solutions

175

25 Sep 2025

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

375

22 Sep 2025

The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA

284

21 Sep 2025

Robust Object Detection for Autonomous Driving via Curriculum-Guided Group Relative Policy Optimization

Xu Jia

133

19 Sep 2025

Re-purposing SAM into Efficient Visual Projectors for MLLM-Based Referring Image Segmentation

Xiaobo Yang

Xiaojin Gong

VLM

119

17 Sep 2025

Improving Generalized Visual Grounding with Instance-aware Joint LearningIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025

255

17 Sep 2025