v1v2v3 (latest)

Generation and Comprehension of Unambiguous Object Descriptions

7 November 2015

ArXiv (abs)PDF HTML Github (164★)

Papers citing "Generation and Comprehension of Unambiguous Object Descriptions"

50 / 919 papers shown

MEEL: Multi-Modal Event Evolution Learning

174

16 Apr 2024

TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding

244

15 Apr 2024

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Chen Chen

...

285

11 Apr 2024

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

Graham Neubig

289

09 Apr 2024

CoReS: Orchestrating the Dance of Reasoning and Segmentation

354

08 Apr 2024

Hyperbolic Learning with Synthetic Captions for Open-World Detection

220

07 Apr 2024

Decoupling Static and Hierarchical Motion Perception for Referring Video SegmentationComputer Vision and Pattern Recognition (CVPR), 2024

Shuting He

Henghui Ding

VOS

288

04 Apr 2024

Cross-Modal Conditioned Reconstruction for Language-guided Medical Image SegmentationIEEE Transactions on Medical Imaging (IEEE TMI), 2024

266

03 Apr 2024

Text-driven Affordance Learning from Egocentric Vision

286

03 Apr 2024

mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning

Jingxuan Wei

Nan Xu

Guiyong Chang

Yin Luo

Bihui Yu

Ruifeng Guo

212

02 Apr 2024

From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

Dahua Lin

339

01 Apr 2024

Deep Instruction Tuning for Segment Anything Model

Chaoyang Zhu

332

31 Mar 2024

M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models

Tiejun Huang

190

100

31 Mar 2024

TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias

245

30 Mar 2024

Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models

Hidetaka Kamigaito

Taro Watanabe

244

29 Mar 2024

J-CRe3: A Japanese Conversation Dataset for Real-world Reference Resolution

Sadao Kurohashi

185

28 Mar 2024

Toward Interactive Regional Understanding in Vision-Large Language Models

305

27 Mar 2024

ReMamber: Referring Image Segmentation with Mamba Twister

Jiangchao Yao

267

26 Mar 2024

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

Han Xiao

346

216

25 Mar 2024

Elysium: Exploring Object-level Perception in Videos via MLLM

321

25 Mar 2024

Empowering Segmentation Ability to Multi-modal Large Language Models

215

21 Mar 2024

DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

Wanli Ouyang

333

19 Mar 2024

GiT: Towards Generalist Vision Transformer through Universal Language InterfaceEuropean Conference on Computer Vision (ECCV), 2024

Muhammad Ferjad Naeem

Jiaming Song

Bernt Schiele

Liwei Wang

VLM

279

14 Mar 2024

Rethinking Referring Object Removal

203

14 Mar 2024

CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language ModelNeural Information Processing Systems (NeurIPS), 2024

Lianli Gao

Jingkuan Song

CLL

197

13 Mar 2024

DeepSeek-VL: Towards Real-World Vision-Language Understanding

...

Chengqi Deng

460

647

08 Mar 2024

Multimodal Infusion Tuning for Large Models

Yen-Wei Chen

342

08 Mar 2024

$$\text{R}^2$-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations$

\text{R}^2

-Bench: Benchmarking the Robustness of Referring Perception Models under PerturbationsEuropean Conference on Computer Vision (ECCV), 2024

Xiang Li

Hao Chen

Bhiksha Raj

189

07 Mar 2024

Adaptive Task Balancing for Visual Instruction Tuning via Inter-Task Contribution and Intra-Task Difficulty

Xiangxiang Chu

Zhiwu Lu

341

07 Mar 2024

VEglue: Testing Visual Entailment Systems via Object-Aligned Joint Erasing

Cheng Li

218

05 Mar 2024

Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

David Wan

Jaemin Cho

Elias Stengel-Eskin

Mohit Bansal

VLM ObjD

316

04 Mar 2024

Zero-shot Generalizable Incremental Learning for Vision-Language Object Detection

553

04 Mar 2024

Non-autoregressive Sequence-to-Sequence Vision-Language Models

335

04 Mar 2024

Adversarial Testing for Visual Grounding via Image-Aware Property Reduction

Cheng Li

240

02 Mar 2024

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

...

Yu Qiao

318

29 Feb 2024

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

376

26 Feb 2024

LLMBind: A Unified Modality-Task Integration Framework

Bin Lin

...

Xing Zhou

271

22 Feb 2024

WinoViz: Probing Visual Properties of Objects Under Different States

Woojeong Jin

Tejas Srinivasan

Jesse Thomason

Xiang Ren

244

21 Feb 2024

The Revolution of Multimodal Large Language Models: A Survey

Lorenzo Baraldi

359

122

19 Feb 2024

Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions

Yisi Zhang

Jing Liu

269

17 Feb 2024

LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition

399

15 Feb 2024

DoRA: Weight-Decomposed Low-Rank Adaptation

769

676

14 Feb 2024

Visual Question Answering Instruction: Unlocking Multimodal Large Language Model To Domain-Specific Visual Multitasks

156

13 Feb 2024

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

...

Yu Qiao

520

139

08 Feb 2024

CogCoM: A Visual Language Model with Chain-of-Manipulations ReasoningInternational Conference on Learning Representations (ICLR), 2024

Ji Qi

...

Bin Xu

Lei Hou

Juanzi Li

Yuxiao Dong

Jie Tang

VLM LRM

243

06 Feb 2024

Generalizable Entity Grounding via Assistance of Large Language Model

Xiangtai Li

Ming-Hsuan Yang

252

04 Feb 2024

Can MLLMs Perform Text-to-Image In-Context Learning?

263

02 Feb 2024

ChatterBox: Multi-round Multimodal Referring and Grounding

213

24 Jan 2024

Collaborative Position Reasoning Network for Referring Image Segmentation

Jingdong Wang

299

22 Jan 2024

Unifying Visual and Vision-Language Tracking via Contrastive LearningAAAI Conference on Artificial Intelligence (AAAI), 2024

221

20 Jan 2024