ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2312.02949
  4. Cited By
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

5 December 2023
Hao Zhang
Hongyang Li
Feng Li
Tianhe Ren
Xueyan Zou
Shilong Liu
Shijia Huang
Jianfeng Gao
Lei Zhang
Chun-yue Li
Jianwei Yang
ArXiv (abs)PDFHTMLHuggingFace (15 upvotes)Github (400★)

Papers citing "LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models"

49 / 49 papers shown
Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark
Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark
Haobo Yuan
Yueyi Sun
Yanwei Li
Tao Zhang
XueQing Deng
Henghui Ding
Lu Qi
Anran Wang
X. Li
Ming-Hsuan Yang
ReLMLRM
337
1
0
04 Dec 2025
Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning
Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning
Xin Gu
H. Zhang
Qihang Fan
Jingxuan Niu
Zhipeng Zhang
Libo Zhang
G. Chen
Fan Chen
Longyin Wen
Sijie Zhu
AI4TSLRM
332
1
0
26 Nov 2025
Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting
Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting
Jiangnan Ye
Jiedong Zhuang
Lianrui Mu
Wenjie Zheng
Jiaqi Hu
Xingze Zou
Jing Wang
Haoji Hu
3DGS
184
0
0
17 Nov 2025
ChartAB: A Benchmark for Chart Grounding & Dense Alignment
ChartAB: A Benchmark for Chart Grounding & Dense Alignment
Aniruddh Bansal
Davit Soselia
Dang Nguyen
Tianyi Zhou
199
0
0
30 Oct 2025
MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos
MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos
Gabriel Fiastre
Antoine Yang
Cordelia Schmid
VOS
448
1
0
16 Oct 2025
Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning
Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning
Xingang Guo
Utkarsh Tyagi
Advait Gosai
Paula Vergara
Ernesto Gabriel Hernández Montoya
...
Bin Hu
Yunzhong He
Bing Liu
Bing Liu
Rakshith S Srinivasa
VLMLRM
325
3
0
14 Oct 2025
VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs
VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs
Peng Liu
H. Shen
Chunxin Fang
Zhicheng Sun
Jiajia Liao
T. Zhao
MLLMObjDVLMLRM
222
2
0
30 Sep 2025
From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models
From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models
Chenyue Zhou
Mingxuan Wang
Yanbiao Ma
Chenxu Wu
Wanyi Chen
...
Guoli Jia
Lingling Li
Z. Lu
Y. Lu
Wenhan Luo
LRM
454
9
0
29 Sep 2025
Towards Understanding Visual Grounding in Visual Language Models
Towards Understanding Visual Grounding in Visual Language Models
Georgios Pantazopoulos
Eda B. Özyiğit
ObjD
320
3
0
12 Sep 2025
Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data
Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data
Honglu Zhou
Xiangyu Peng
Shrikant B. Kendre
Michael S Ryoo
Silvio Savarese
Caiming Xiong
Juan Carlos Niebles
120
1
0
03 Sep 2025
PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?
PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?
Mennatullah Siam
VGen
118
0
0
02 Sep 2025
ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following
ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following
Seungmin Han
Haeun Kwon
Ji-jun Park
Taeyang Yoon
LRM
98
1
0
21 Aug 2025
MAG-Nav: Language-Driven Object Navigation Leveraging Memory-Reserved Active Grounding
MAG-Nav: Language-Driven Object Navigation Leveraging Memory-Reserved Active Grounding
Weifan Zhang
Tingguang Li
Yuzhen Liu
LM&Ro
104
1
0
07 Aug 2025
Fine-grained Spatiotemporal Grounding on Egocentric Videos
Fine-grained Spatiotemporal Grounding on Egocentric Videos
Shuo Liang
Yiwu Zhong
Zi-Yuan Hu
Yeyao Tao
Liwei Wang
EgoV
258
5
0
01 Aug 2025
LMM-Det: Make Large Multimodal Models Excel in Object Detection
LMM-Det: Make Large Multimodal Models Excel in Object Detection
Jincheng Li
Chunyu Xie
Ji Ao
Dawei Leng
Yuhui Yin
MLLMObjDVLM
332
6
0
24 Jul 2025
InsightX Agent: An LMM-based Agentic Framework with Integrated Tools for Reliable X-ray NDT Analysis
InsightX Agent: An LMM-based Agentic Framework with Integrated Tools for Reliable X-ray NDT Analysis
Jiale Liu
Huan Wang
Yue Zhang
Xiaoyu Luo
Jiaxiang Hu
Zhiliang Liu
Min Xie
LLMAGAI4CE
124
1
0
20 Jul 2025
Constrained Diffusion Models for Synthesizing Representative Power Flow Datasets
Constrained Diffusion Models for Synthesizing Representative Power Flow Datasets
Milad Hoseinpour
Vladimir Dvorkin
DiffMMedIm
243
0
0
12 Jun 2025
Synthetic Visual Genome
Synthetic Visual GenomeComputer Vision and Pattern Recognition (CVPR), 2025
J. S. Park
Zixian Ma
Linjie Li
Chenhao Zheng
Cheng-Yu Hsieh
...
Quan Kong
Norimasa Kobori
Ali Farhadi
Yejin Choi
Ranjay Krishna
212
0
0
09 Jun 2025
Perceiving Beyond Language Priors: Enhancing Visual Comprehension and Attention in Multimodal Models
Perceiving Beyond Language Priors: Enhancing Visual Comprehension and Attention in Multimodal Models
Aarti Ghatkesar
Uddeshya Upadhyay
VLM
390
1
0
08 May 2025
Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs
Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs
Tiancheng Gu
Kaicheng Yang
Ziyong Feng
Xingjun Wang
Yanzhao Zhang
Dingkun Long
Yingda Chen
Weidong Cai
Jiankang Deng
VLM
907
35
0
24 Apr 2025
Domain-Conditioned Scene Graphs for State-Grounded Task Planning
Domain-Conditioned Scene Graphs for State-Grounded Task Planning
Jonas Herzog
Jiangpin Liu
Yue Wang
LM&Ro
316
2
0
09 Apr 2025
Multimodal Reference Visual Grounding
Multimodal Reference Visual Grounding
Yangxiao Lu
Ruosen Li
Liqiang Jing
Jikai Wang
Xinya Du
Yunhui Guo
Nicholas Ruozzi
Yu Xiang
ObjD
329
1
0
02 Apr 2025
RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning
RefChartQA: Grounding Visual Answer on Chart Images through Instruction TuningIEEE International Conference on Document Analysis and Recognition (ICDAR), 2025
Alexander Vogel
Omar Moured
Yufan Chen
Kailai Li
Rainer Stiefelhagen
376
4
0
29 Mar 2025
Large-scale Pre-training for Grounded Video Caption Generation
Large-scale Pre-training for Grounded Video Caption Generation
Evangelos Kazakos
Cordelia Schmid
Josef Sivic
452
3
0
13 Mar 2025
ProAPO: Progressively Automatic Prompt Optimization for Visual Classification
ProAPO: Progressively Automatic Prompt Optimization for Visual ClassificationComputer Vision and Pattern Recognition (CVPR), 2025
Xiangyan Qu
Gaopeng Gou
Jiamin Zhuang
Jing Yu
Kun Song
Qihao Wang
Yili Li
Gang Xiong
VLM
691
11
0
13 Mar 2025
REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding
REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding
Yan Tai
Luhao Zhu
Zhiqiang Chen
Ynan Ding
Yiying Dong
Xiaohong Liu
Guodong Guo
MLLMObjD
212
0
0
10 Mar 2025
New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration
New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM CollaborationIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025
X. J. Yang
Jing Liu
Peng Wang
Guoqing Wang
Yue Yang
Mengqi Li
ObjD
491
5
0
27 Feb 2025
CNS-Obsidian: A Neurosurgical Vision-Language Model Built From Scientific Publications
CNS-Obsidian: A Neurosurgical Vision-Language Model Built From Scientific Publications
Anton Alyakin
Jaden Stryker
Daniel Alber
Karl L. Sangwon
Brandon Duderstadt
...
Laura Snyder
Eric Leuthardt
Douglas Kondziolka
E. Oermann
E. Oermann
626
0
0
26 Feb 2025
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency
Dongzhi Jiang
Renrui Zhang
Ziyu Guo
Yanwei Li
Yu Qi
...
Shen Yan
Bo Zhang
Chaoyou Fu
Peng Gao
Jiaming Song
MLLMLRM
453
88
0
13 Feb 2025
PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?
PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?
Mennatullah Siam
VLM
773
3
0
06 Feb 2025
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Haobo Yuan
Xianrui Li
Tao Zhang
Zilong Huang
Shilin Xu
...
Yunhai Tong
Lu Qi
Jiashi Feng
Ming-Hsuan Yang
Ming-Hsuan Yang
VLM
611
68
0
07 Jan 2025
Visual Large Language Models for Generalized and Specialized Applications
Jiayi Zhang
Zhixin Lai
Wentao Bao
Zhen Tan
Anh Dao
Kewei Sui
Jiayi Shen
Dong Liu
Huan Liu
Yu Kong
VLM
461
33
0
06 Jan 2025
Towards Visual Grounding: A Survey
Towards Visual Grounding: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Linhui Xiao
Xiaoshan Yang
X. Lan
Yaowei Wang
Changsheng Xu
ObjD
985
31
0
28 Dec 2024
Aria-UI: Visual Grounding for GUI Instructions
Aria-UI: Visual Grounding for GUI InstructionsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Yuhao Yang
Yue Wang
Dongxu Li
Ziyang Luo
Bei Chen
Chenyu Huang
Junnan Li
LM&RoLLMAG
502
94
0
20 Dec 2024
TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video
TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video
Jinyuan Qu
Hongyang Li
Shilong Liu
Tianhe Ren
Zhaoyang Zeng
Lei Zhang
3DPC
530
6
0
27 Nov 2024
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
Qing Jiang
Gen Luo
Yuqin Yang
Yuda Xiong
Yihao Chen
Zhaoyang Zeng
Tianhe Ren
Lei Zhang
VLMLRM
555
22
0
27 Nov 2024
DOGR: Towards Versatile Visual Document Grounding and Referring
DOGR: Towards Versatile Visual Document Grounding and Referring
Yinan Zhou
Yuxin Chen
Haokun Lin
Shuyu Yang
Li Zhu
Chen Ma
Chen Ma
Mingyu Ding
Ying Shan
ObjD
555
4
0
26 Nov 2024
EDGE: Enhanced Grounded GUI Understanding with Enriched
  Multi-Granularity Synthetic Data
EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data
Xuetian Chen
Hangcheng Li
Jiaqing Liang
Sihang Jiang
Deqing Yang
LLMAG
466
7
0
25 Oct 2024
Zero-shot Action Localization via the Confidence of Large Vision-Language Models
Zero-shot Action Localization via the Confidence of Large Vision-Language Models
Josiah Aklilu
Xiaohan Wang
Serena Yeung-Levy
332
1
0
18 Oct 2024
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs
Yunqiu Xu
Linchao Zhu
Yi Yang
419
17
0
16 Oct 2024
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression ComprehensionConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Junzhuo Liu
Xiaohu Yang
Weiwei Li
Peng Wang
ObjD
389
13
0
23 Sep 2024
ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large
  Language Models
ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models
Ming-Kuan Wu
Xinyue Cai
Jiayi Ji
Jiale Li
Oucheng Huang
Gen Luo
Hao Fei
Xiaoshuai Sun
Rongrong Ji
MLLM
329
29
0
31 Jul 2024
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
Yusu Qian
Hanrong Ye
J. Fauconnier
Peter Grasch
Yinfei Yang
Zhe Gan
659
41
0
01 Jul 2024
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
Xiangyu Zhao
Xiangtai Li
Haodong Duan
Haian Huang
Yining Li
Kai Chen
Hua Yang
VLMMLLM
335
21
0
25 Jun 2024
Grounding Multimodal Large Language Models in Actions
Grounding Multimodal Large Language Models in Actions
Andrew Szot
Bogdan Mazoure
Harsh Agrawal
Devon Hjelm
Z. Kira
Alexander Toshev
LM&Ro
234
33
0
12 Jun 2024
F-LMM: Grounding Frozen Large Multimodal Models
F-LMM: Grounding Frozen Large Multimodal ModelsComputer Vision and Pattern Recognition (CVPR), 2024
Size Wu
Sheng Jin
Wenwei Zhang
Lumin Xu
Wentao Liu
Wei Li
Chen Change Loy
MLLM
721
21
0
09 Jun 2024
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Keen You
Haotian Zhang
E. Schoop
Floris Weers
Amanda Swearngin
Jeffrey Nichols
Yinfei Yang
Zhe Gan
MLLM
353
150
0
08 Apr 2024
Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery
Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery
Guan-Feng Wang
Long Bai
Wan Jun Nah
Jie Wang
Zhaoxi Zhang
Zhen Chen
Jinlin Wu
Mobarakol Islam
Hongbin Liu
Hongliang Ren
350
28
0
22 Mar 2024
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
Shilong Zhang
Pei Sun
Shoufa Chen
Min Xiao
Wenqi Shao
Wenwei Zhang
Yu Liu
Kai-xiang Chen
Ping Luo
MLLMVLM
913
320
0
07 Jul 2023
1