Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.03601
Cited By
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
7 July 2023
Shilong Zhang
Pei Sun
Shoufa Chen
Min Xiao
Wenqi Shao
Wenwei Zhang
Yu Liu
Kai-xiang Chen
Ping Luo
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest"
50 / 199 papers shown
Title
RESAnything: Attribute Prompting for Arbitrary Referring Segmentation
Ruiqi Wang
Hao Zhang
VLM
44
0
0
03 May 2025
SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction Understanding
Qianqian Sun
Jixiang Luo
Dell Zhang
Xuelong Li
DiffM
44
0
0
17 Apr 2025
EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery
Wei Zhang
Miaoxin Cai
Yaqian Ning
T. Zhang
Yin Zhuang
He Chen
Jun Li
Xuerui Mao
36
0
0
17 Apr 2025
Foundation Models for Remote Sensing: An Analysis of MLLMs for Object Localization
Darryl Hannan
John Cooper
Dylan White
Timothy Doster
Henry Kvinge
Y. Watkins
24
0
0
14 Apr 2025
Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions
Tommaso Galliena
Tommaso Apicella
Stefano Rosa
Pietro Morerio
Alessio Del Bue
Lorenzo Natale
22
0
0
11 Apr 2025
URECA: Unique Region Caption Anything
Sangbeom Lim
J. Kim
Heeji Yoon
Jaewoo Jung
Seungryong Kim
24
0
0
07 Apr 2025
InteractVLM: 3D Interaction Reasoning from 2D Foundational Models
Sai Kumar Dwivedi
Dimitrije Antić
Shashank Tripathi
Omid Taheri
Cordelia Schmid
M. Black
Dimitrios Tzionas
18
1
0
07 Apr 2025
Towards Visual Text Grounding of Multimodal Large Language Model
Ming Li
Ruiyi Zhang
Jian Chen
Jiuxiang Gu
Yufan Zhou
Franck Dernoncourt
Wanrong Zhu
Tianyi Zhou
Tong Sun
18
2
0
07 Apr 2025
On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices
Bosung Kim
Kyuhwan Lee
Isu Jeong
Jungmin Cheon
Yeojin Lee
Seulki Lee
VGen
40
1
0
31 Mar 2025
Towards Understanding How Knowledge Evolves in Large Vision-Language Models
Sudong Wang
Y. Zhang
Yao Zhu
Jianing Li
Zizhe Wang
Y. Liu
Xiangyang Ji
29
0
0
31 Mar 2025
MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation
Donggon Jang
Yucheol Cho
Suin Lee
Taehyeon Kim
Dae-Shik Kim
VLM
59
1
0
18 Mar 2025
ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large language Models
Hao Yin
Guangzong Si
Zilei Wang
39
0
0
17 Mar 2025
MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
Erik Daxberger
Nina Wenzel
David Griffiths
Haiming Gang
Justin Lazarow
...
Kai Kang
Marcin Eichner
Y. Yang
Afshin Dehghan
Peter Grasch
72
2
0
17 Mar 2025
Long-Video Audio Synthesis with Multi-Agent Collaboration
Yehang Zhang
Xinli Xu
Xiaojie Xu
L. Liu
Y. Chen
DiffM
VGen
42
0
0
13 Mar 2025
Towards Large-scale Chemical Reaction Image Parsing via a Multimodal Large Language Model
Yufan Chen
Ching Ting Leung
Jianwei Sun
Yong Huang
Linyan Li
Hao Chen
Hanyu Gao
52
0
0
11 Mar 2025
Adaptive Keyframe Sampling for Long Video Understanding
Xi Tang
Jihao Qiu
Lingxi Xie
Yunjie Tian
Jianbin Jiao
Qixiang Ye
72
0
0
28 Feb 2025
Magma: A Foundation Model for Multimodal AI Agents
Jianwei Yang
Reuben Tan
Qianhui Wu
Ruijie Zheng
Baolin Peng
...
Seonghyeon Ye
Joel Jang
Yuquan Deng
Lars Liden
Jianfeng Gao
VLM
AI4TS
89
8
0
18 Feb 2025
Pixel-Level Reasoning Segmentation via Multi-turn Conversations
Dexian Cai
Xiaocui Yang
Yongkang Liu
Daling Wang
Shi Feng
Yifei Zhang
Soujanya Poria
LRM
72
0
0
13 Feb 2025
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Miran Heo
Min-Hung Chen
De-An Huang
Sifei Liu
Subhashree Radhakrishnan
Seon Joo Kim
Yu-Chun Wang
Ryo Hachiuma
ObjD
VLM
100
2
0
14 Jan 2025
Visual Large Language Models for Generalized and Specialized Applications
Yifan Li
Zhixin Lai
Wentao Bao
Zhen Tan
Anh Dao
Kewei Sui
Jiayi Shen
Dong Liu
Huan Liu
Yu Kong
VLM
83
10
0
06 Jan 2025
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
...
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
MLLM
VLM
LRM
86
45
0
03 Jan 2025
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Yuqian Yuan
Hang Zhang
Wentong Li
Zesen Cheng
Boqiang Zhang
...
Deli Zhao
Wenqiao Zhang
Yueting Zhuang
Jianke Zhu
Lidong Bing
41
5
0
31 Dec 2024
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Hao Fei
Shengqiong Wu
H. Zhang
Tat-Seng Chua
Shuicheng Yan
51
35
0
31 Dec 2024
Personalized Large Vision-Language Models
Chau Pham
Hoang Phan
David Doermann
Yunjie Tian
VLM
39
3
0
23 Dec 2024
Aria-UI: Visual Grounding for GUI Instructions
Yuhao Yang
Yue Wang
Dongxu Li
Ziyang Luo
Bei Chen
C. Huang
Junnan Li
LM&Ro
LLMAG
92
14
0
20 Dec 2024
Next Patch Prediction for Autoregressive Visual Generation
Yatian Pang
Peng Jin
Shuo Yang
Bin Lin
Bin Zhu
...
Liuhan Chen
Francis E. H. Tay
Ser-Nam Lim
Harry Yang
Li Yuan
109
8
0
19 Dec 2024
Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM
H. Wang
Yuxiang Nie
Yongjie Ye
Deng GuanYu
Yanjie Wang
Shuai Li
Haiyang Yu
Jinghui Lu
Can Huang
VLM
MLLM
72
1
0
12 Dec 2024
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation
Ao Wang
Hui Chen
Jianchao Tan
K. Zhang
Xunliang Cai
Zijia Lin
J. Han
Guiguang Ding
VLM
72
3
0
04 Dec 2024
SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection
Joongwon Chae
Zhenyu Wang
Peiwu Qin
VLM
77
0
0
03 Dec 2024
SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model
Chunlin Yu
Hanqing Wang
Ye Shi
Haoyang Luo
Sibei Yang
Jingyi Yu
Jingya Wang
LRM
LM&Ro
74
1
0
02 Dec 2024
LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation
Huadong Tang
Youpeng Zhao
Y. Huang
Min Xu
Jun Wang
Qiang Wu
MLLM
VLM
73
0
0
30 Nov 2024
ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model
Kunyang Han
Yibo Hu
Mengxue Qu
Hailin Shi
Yao Zhao
Y. X. Wei
MLLM
VLM
3DV
72
1
0
29 Nov 2024
Detailed Object Description with Controllable Dimensions
Xinran Wang
H. Zhang
Baoteng Li
Kongming Liang
Hao Sun
Zhongjiang He
Z. Ma
Jun Guo
76
0
0
28 Nov 2024
Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models
J. Liu
Yumeng Li
Boyuan Xiao
Yichang Jian
Ziang Qin
Tianjia Shao
Yao-Xiang Ding
Kun Zhou
MLLM
LRM
86
2
0
27 Nov 2024
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
Qing Jiang
Gen Luo
Yuqin Yang
Yuda Xiong
Yihao Chen
Zhaoyang Zeng
Tianhe Ren
Lei Zhang
VLM
LRM
84
6
0
27 Nov 2024
freePruner: A Training-free Approach for Large Multimodal Model Acceleration
Bingxin Xu
Yuzhang Shang
Yunhao Ge
Qian Lou
Yan Yan
81
2
0
23 Nov 2024
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity
Hang Hua
Qing Liu
Lingzhi Zhang
Jing Shi
Zhifei Zhang
Yilin Wang
Jianming Zhang
Jiebo Luo
CoGe
VLM
79
6
0
23 Nov 2024
Large Language Model with Region-guided Referring and Grounding for CT Report Generation
Z. Chen
Yequan Bie
Haibo Jin
Hao Chen
81
0
0
23 Nov 2024
Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models
Wei Wang
Z. Li
Qi Xu
Linfeng Li
Yiqing Cai
Botian Jiang
Hang Song
Xingcan Hu
Pengyu Wang
Li Xiao
24
1
0
14 Nov 2024
KptLLM: Unveiling the Power of Large Language Model for Keypoint Comprehension
Jie-jin Yang
Wang Zeng
Sheng Jin
Lumin Xu
Wentao Liu
Chen Qian
Ruimao Zhang
MLLM
49
2
0
04 Nov 2024
FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for Multi-Modal Tobacco Content Analysis
N. V. R. Chappa
P. Dobbs
Bhiksha Raj
Khoa Luu
21
3
0
25 Oct 2024
EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data
Xuetian Chen
Hangcheng Li
Jiaqing Liang
Sihang Jiang
Deqing Yang
LLMAG
43
2
0
25 Oct 2024
SegLLM: Multi-round Reasoning Segmentation
XuDong Wang
Shaolun Zhang
Shufan Li
Konstantinos Kallidromitis
Kehan Li
Yusuke Kato
Kazuki Kozuka
Trevor Darrell
VLM
LRM
42
1
0
24 Oct 2024
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models
Yufei Zhan
Hongyin Zhao
Yousong Zhu
Fan Yang
Ming Tang
Jinqiao Wang
MLLM
43
1
0
21 Oct 2024
Mitigating Object Hallucination via Concentric Causal Attention
Yun Xing
Yiheng Li
Ivan Laptev
Shijian Lu
36
17
0
21 Oct 2024
LocateBench: Evaluating the Locating Ability of Vision Language Models
Ting-Rui Chiang
Joshua Robinson
Xinyan Velocity Yu
Dani Yogatama
VLM
ELM
32
0
0
17 Oct 2024
RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models
Haoran Hao
Jiaming Han
Changsheng Li
Yu-Feng Li
Xiangyu Yue
RALM
37
1
0
17 Oct 2024
When Does Perceptual Alignment Benefit Vision Representations?
Shobhita Sundaram
Stephanie Fu
Lukas Muttenthaler
Netanel Y. Tamir
Lucy Chai
Simon Kornblith
Trevor Darrell
Phillip Isola
36
8
1
14 Oct 2024
HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding
Keliang Li
Zaifei Yang
Jiahe Zhao
Hongze Shen
Ruibing Hou
Hong Chang
Shiguang Shan
Xilin Chen
VLM
23
0
0
09 Oct 2024
EMMA: Efficient Visual Alignment in Multi-Modal LLMs
Sara Ghazanfari
Alexandre Araujo
P. Krishnamurthy
Siddharth Garg
Farshad Khorrami
VLM
47
1
0
02 Oct 2024
1
2
3
4
Next