LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

5 December 2023

Lei Zhang

Jianwei Yang

ArXiv (abs)PDF HTML HuggingFace (15 upvotes)Github (400★)

Papers citing "LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models"

49 / 49 papers shown

Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark

337

04 Dec 2025

Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning

332

26 Nov 2025

Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting

184

17 Nov 2025

ChartAB: A Benchmark for Chart Grounding & Dense Alignment

199

30 Oct 2025

MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos

448

16 Oct 2025

Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning

Ernesto Gabriel Hernández Montoya

...

325

14 Oct 2025

VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs

222

30 Sep 2025

From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models

...

454

29 Sep 2025

Towards Understanding Visual Grounding in Visual Language Models

Georgios Pantazopoulos

Eda B. Özyiğit

ObjD

320

12 Sep 2025

Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data

120

03 Sep 2025

PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?

Mennatullah Siam

VGen

118

02 Sep 2025

ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following

21 Aug 2025

MAG-Nav: Language-Driven Object Navigation Leveraging Memory-Reserved Active Grounding

104

07 Aug 2025

Fine-grained Spatiotemporal Grounding on Egocentric Videos

258

01 Aug 2025

LMM-Det: Make Large Multimodal Models Excel in Object Detection

332

24 Jul 2025

InsightX Agent: An LMM-based Agentic Framework with Integrated Tools for Reliable X-ray NDT Analysis

124

20 Jul 2025

Constrained Diffusion Models for Synthesizing Representative Power Flow Datasets

Milad Hoseinpour

Vladimir Dvorkin

DiffM MedIm

243

12 Jun 2025

Synthetic Visual GenomeComputer Vision and Pattern Recognition (CVPR), 2025

...

212

09 Jun 2025

Perceiving Beyond Language Priors: Enhancing Visual Comprehension and Attention in Multimodal Models

Aarti Ghatkesar

Uddeshya Upadhyay

VLM

390

08 May 2025

Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

907

24 Apr 2025

Domain-Conditioned Scene Graphs for State-Grounded Task Planning

316

09 Apr 2025

Multimodal Reference Visual Grounding

329

02 Apr 2025

RefChartQA: Grounding Visual Answer on Chart Images through Instruction TuningIEEE International Conference on Document Analysis and Recognition (ICDAR), 2025

376

29 Mar 2025

Large-scale Pre-training for Grounded Video Caption Generation

Evangelos Kazakos

Cordelia Schmid

Josef Sivic

452

13 Mar 2025

ProAPO: Progressively Automatic Prompt Optimization for Visual ClassificationComputer Vision and Pattern Recognition (CVPR), 2025

691

13 Mar 2025

REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding

212

10 Mar 2025

New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM CollaborationIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025

491

27 Feb 2025

CNS-Obsidian: A Neurosurgical Vision-Language Model Built From Scientific Publications

...

626

26 Feb 2025

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

...

453

13 Feb 2025

PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?

Mennatullah Siam

VLM

773

06 Feb 2025

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

...

611

07 Jan 2025

Visual Large Language Models for Generalized and Specialized Applications

461

06 Jan 2025

Towards Visual Grounding: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024

985

28 Dec 2024

Aria-UI: Visual Grounding for GUI InstructionsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

502

20 Dec 2024

TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video

530

27 Nov 2024

ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

555

27 Nov 2024

DOGR: Towards Versatile Visual Document Grounding and Referring

555

26 Nov 2024

EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data

466

25 Oct 2024

Zero-shot Action Localization via the Confidence of Large Vision-Language Models

Josiah Aklilu

Xiaohan Wang

Serena Yeung-Levy

332

18 Oct 2024

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

Yunqiu Xu

Linchao Zhu

Yi Yang

419

16 Oct 2024

FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression ComprehensionConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

389

23 Sep 2024

ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models

Jiayi Ji

329

31 Jul 2024

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

659

01 Jul 2024

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

Xiangyu Zhao

Xiangtai Li

Haodong Duan

Haian Huang

Yining Li

Kai Chen

Hua Yang

VLM MLLM

335

25 Jun 2024

Grounding Multimodal Large Language Models in Actions

234

12 Jun 2024

F-LMM: Grounding Frozen Large Multimodal ModelsComputer Vision and Pattern Recognition (CVPR), 2024

Wei Li

721

09 Jun 2024

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

353

150

08 Apr 2024

Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery

Zhen Chen

Jinlin Wu

Mobarakol Islam

Hongbin Liu

Hongliang Ren

350

22 Mar 2024

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

913

320

07 Jul 2023