Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2304.10592
Cited By
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
20 April 2023
Deyao Zhu
Jun Chen
Xiaoqian Shen
Xiang Li
Mohamed Elhoseiny
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models"
50 / 295 papers shown
Title
Treble Counterfactual VLMs: A Causal Approach to Hallucination
Li Li
Jiashu Qu
Yuxiao Zhou
Yuehan Qin
Tiankai Yang
Yue Zhao
81
1
0
08 Mar 2025
Vision-based 3D Semantic Scene Completion via Capture Dynamic Representations
Meng Wang
Fan Wu
Yunchuan Qin
Ruihui Li
Zhuo Tang
KenLi Li
3DPC
91
0
0
08 Mar 2025
Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks
Liming Lu
Shuchao Pang
Siyuan Liang
Haotian Zhu
Xiyu Zeng
Aishan Liu
Yunhuai Liu
Yongbin Zhou
AAML
49
1
0
05 Mar 2025
Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs
Wei-Yao Wang
Zhao Wang
Helen Suzuki
Yoshiyuki Kobayashi
LRM
50
1
0
04 Mar 2025
Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data
Haoxin Li
Boyang Li
CoGe
69
0
0
03 Mar 2025
WeGen: A Unified Model for Interactive Multimodal Generation as We Chat
Zhipeng Huang
Shaobin Zhuang
Canmiao Fu
Binxin Yang
Ying Zhang
Chong Sun
Zhizheng Zhang
Yali Wang
Chen Li
Zheng-Jun Zha
DiffM
69
1
0
03 Mar 2025
Re-Imagining Multimodal Instruction Tuning: A Representation View
Yiyang Liu
James Liang
Ruixiang Tang
Yugyung Lee
Majid Rabbani
...
Raghuveer M. Rao
Lifu Huang
Dongfang Liu
Qifan Wang
Cheng Han
78
0
0
02 Mar 2025
MMKE-Bench: A Multimodal Editing Benchmark for Diverse Visual Knowledge
Yuntao Du
Kailin Jiang
Zhi Gao
Chenrui Shi
Zilong Zheng
Siyuan Qi
Qing Li
KELM
65
2
0
27 Feb 2025
R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts
Zhongyang Li
Ziyue Li
Tianyi Zhou
MoE
44
0
0
27 Feb 2025
Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models
Zhaoyi Liu
Huan Zhang
AAML
72
0
0
25 Feb 2025
Can Large Language Models Extract Customer Needs as well as Professional Analysts?
Artem Timoshenko
Chengfeng Mao
J. Hauser
ELM
50
0
0
25 Feb 2025
Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence
Wenzhe Yin
Zehao Xiao
Pan Zhou
Shujian Yu
Jiayi Shen
J. Sonke
E. Gavves
34
0
0
24 Feb 2025
Tracking the Copyright of Large Vision-Language Models through Parameter Learning Adversarial Images
Yubo Wang
Jianting Tang
Chaohu Liu
Linli Xu
AAML
51
1
0
23 Feb 2025
LOVA3: Learning to Visual Question Answering, Asking and Assessment
Henry Hengyuan Zhao
Pan Zhou
Difei Gao
Zechen Bai
Mike Zheng Shou
77
8
0
21 Feb 2025
Understanding and Rectifying Safety Perception Distortion in VLMs
Xiaohan Zou
Jian Kang
George Kesidis
Lu Lin
126
1
0
18 Feb 2025
Magma: A Foundation Model for Multimodal AI Agents
Jianwei Yang
Reuben Tan
Qianhui Wu
Ruijie Zheng
Baolin Peng
...
Seonghyeon Ye
Joel Jang
Yuquan Deng
Lars Liden
Jianfeng Gao
VLM
AI4TS
107
9
0
18 Feb 2025
GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis
Angelos Zavras
Dimitrios Michail
Xiao Xiang Zhu
Begum Demir
Ioannis Papoutsis
VLM
81
0
0
13 Feb 2025
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models
Zhenxing Mi
Kuan-Chieh Jackson Wang
Guocheng Qian
Hanrong Ye
Runtao Liu
Sergey Tulyakov
Kfir Aberman
Dan Xu
LRM
42
0
0
12 Feb 2025
Survey on AI-Generated Media Detection: From Non-MLLM to MLLM
Yueying Zou
Peipei Li
Zekun Li
Huaibo Huang
Xing Cui
Xuannan Liu
Chenghanyu Zhang
Ran He
DeLMO
118
1
0
07 Feb 2025
LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models
Tzu-Tao Chang
Shivaram Venkataraman
VLM
111
0
0
04 Feb 2025
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Jinyang Wu
Mingkuan Feng
Shuai Zhang
Ruihan Jin
Feihu Che
Zengqi Wen
J. Tao
LRM
68
8
0
04 Feb 2025
When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search
Xuan Chen
Yuzhou Nie
Wenbo Guo
Xiangyu Zhang
110
9
0
28 Jan 2025
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model
Xianwei Zhuang
Yuxin Xie
Yufan Deng
Liming Liang
Jinghan Ru
Yuguo Yin
Yuexian Zou
MLLM
VLM
LRM
109
4
0
21 Jan 2025
Isolated Diffusion: Optimizing Multi-Concept Text-to-Image Generation Training-Freely with Isolated Diffusion Guidance
Jin Zhu
Huimin Ma
Jiansheng Chen
Jian Yuan
71
4
0
20 Jan 2025
Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection
Yuanze Li
Haolin Wang
Shihao Yuan
Ming-Yu Liu
Debin Zhao
Yiwen Guo
Chen Xu
Guangming Shi
Wangmeng Zuo
79
28
0
20 Jan 2025
SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
Junyi Ao
Yuancheng Wang
Xiaohai Tian
Dekun Chen
J. Zhang
Lu Lu
Y. Wang
Haizhou Li
Z. Wu
AuLLM
80
16
0
17 Jan 2025
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Miran Heo
Min-Hung Chen
De-An Huang
Sifei Liu
Subhashree Radhakrishnan
Seon Joo Kim
Yu-Chun Wang
Ryo Hachiuma
ObjD
VLM
127
2
0
14 Jan 2025
GestLLM: Advanced Hand Gesture Interpretation via Large Language Models for Human-Robot Interaction
Oleg Kobzarev
Artem Lykov
Dzmitry Tsetserukou
VLM
45
1
0
13 Jan 2025
OneLLM: One Framework to Align All Modalities with Language
Jiaming Han
Kaixiong Gong
Yiyuan Zhang
Jiaqi Wang
Kaipeng Zhang
D. Lin
Yu Qiao
Peng Gao
Xiangyu Yue
MLLM
104
107
0
10 Jan 2025
AgroGPT: Efficient Agricultural Vision-Language Model with Expert Tuning
Muhammad Awais
Ali Husain Salem Abdulla Alharthi
Amandeep Kumar
Hisham Cholakkal
Rao Muhammad Anwer
VLM
65
3
0
10 Jan 2025
Audio-Language Datasets of Scenes and Events: A Survey
Gijs Wijngaard
Elia Formisano
Michele Esposito
M. Dumontier
79
2
0
10 Jan 2025
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics
Ruilin Luo
Zhuofan Zheng
Yifan Wang
Yiyao Yu
Xinzhe Ni
Zicheng Lin
Jin Zeng
Yujiu Yang
LRM
68
12
0
08 Jan 2025
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Shaolei Zhang
Qingkai Fang
Zhe Yang
Yang Feng
MLLM
VLM
59
25
0
07 Jan 2025
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
...
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
MLLM
VLM
LRM
91
46
0
03 Jan 2025
Instruction-Guided Scene Text Recognition
Yongkun Du
Z. Chen
Yuchen Su
Caiyan Jia
Yu-Gang Jiang
71
3
0
03 Jan 2025
Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs
Linhao Huang
Xue Jiang
Zhiqiang Wang
Wentao Mo
Xi Xiao
Bo Han
Yongjie Yin
Feng Zheng
AAML
42
2
0
02 Jan 2025
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
Zhangyang Qi
Zhixiong Zhang
Ye Fang
Jiaqi Wang
Hengshuang Zhao
83
6
0
02 Jan 2025
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Wenqi Zhang
Hang Zhang
Xin Li
Jiashuo Sun
Yongliang Shen
Weiming Lu
Deli Zhao
Yueting Zhuang
Lidong Bing
VLM
37
2
0
01 Jan 2025
In-Context Learning with Iterative Demonstration Selection
Chengwei Qin
Aston Zhang
C. L. P. Chen
Anirudh Dagar
Wenming Ye
LRM
68
38
0
31 Dec 2024
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
Pinelopi Papalampidi
Skanda Koppula
Shreya Pathak
Justin T Chiu
Joseph Heyward
Viorica Patraucean
Jiajun Shen
Antoine Miech
Andrew Zisserman
Aida Nematzdeh
VLM
58
24
0
31 Dec 2024
Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection
Le Yang
Ziwei Zheng
Boxu Chen
Zhengyu Zhao
Chenhao Lin
Chao Shen
VLM
138
3
0
18 Dec 2024
Empowering LLMs to Understand and Generate Complex Vector Graphics
Ximing Xing
Juncheng Hu
Guotao Liang
Jing Zhang
Dong Xu
Qian Yu
92
7
0
15 Dec 2024
Olympus: A Universal Task Router for Computer Vision Tasks
Yuanze Lin
Yunsheng Li
Dongdong Chen
Weijian Xu
Ronald Clark
Philip H. S. Torr
VLM
ObjD
144
0
0
12 Dec 2024
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
Lu Qiu
Yuying Ge
Yi Chen
Yixiao Ge
Ying Shan
Xihui Liu
LLMAG
LRM
94
5
0
05 Dec 2024
DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation
Q. He
Jinlong Peng
P. Xu
Boyuan Jiang
Xiaobin Hu
...
Y. Liu
Y. Wang
Chengjie Wang
X. Li
J. Zhang
DiffM
120
1
0
04 Dec 2024
SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model
Chunlin Yu
Hanqing Wang
Ye Shi
Haoyang Luo
Sibei Yang
Jingyi Yu
Jingya Wang
LRM
LM&Ro
79
1
0
02 Dec 2024
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
Sanghwan Kim
Rui Xiao
Mariana-Iuliana Georgescu
Stephan Alaniz
Zeynep Akata
VLM
70
0
0
02 Dec 2024
ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model
Kunyang Han
Yibo Hu
Mengxue Qu
Hailin Shi
Yao Zhao
Y. X. Wei
MLLM
VLM
3DV
83
1
0
29 Nov 2024
GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding
Yawen Shao
Wei-dong Zhai
Yuhang Yang
Hongchen Luo
Yang Cao
Zheng-jun Zha
86
1
0
29 Nov 2024
Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads
Siqi Kou
Jiachun Jin
Chang Liu
Ye Ma
Jian Jia
Quan Chen
Peng Jiang
Zhijie Deng
Zhijie Deng
DiffM
VGen
VLM
116
5
0
28 Nov 2024
Previous
1
2
3
4
5
6
Next