Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2304.10592
Cited By
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
20 April 2023
Deyao Zhu
Jun Chen
Xiaoqian Shen
Xiang Li
Mohamed Elhoseiny
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models"
50 / 292 papers shown
Title
Visually Interpretable Subtask Reasoning for Visual Question Answering
Yu Cheng
A. Goel
Hakan Bilen
LRM
24
0
0
12 May 2025
Task-Oriented Semantic Communication in Large Multimodal Models-based Vehicle Networks
Baoxia Du
H. Du
Dusit Niyato
Ruidong Li
51
0
0
05 May 2025
MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation
Mingcheng Li
Xiaolu Hou
Ziyang Liu
Dingkang Yang
Ziyun Qian
Jiawei Chen
Jinjie Wei
Y. Jiang
Qingyao Xu
L. Zhang
DiffM
85
0
0
05 May 2025
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
X. Zhang
Jintao Guo
Shanshan Zhao
Minghao Fu
Lunhao Duan
Guo-Hua Wang
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
DiffM
60
0
0
05 May 2025
R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation
Meng-Hao Guo
Jiajun Xu
Yi Zhang
Jiaxi Song
Haoyang Peng
...
Yongming Rao
Houwen Peng
Han Hu
Gordon Wetzstein
Shi-Min Hu
ELM
LRM
55
0
0
04 May 2025
RESAnything: Attribute Prompting for Arbitrary Referring Segmentation
Ruiqi Wang
Hao Zhang
VLM
52
0
0
03 May 2025
Transferable Adversarial Attacks on Black-Box Vision-Language Models
Kai Hu
Weichen Yu
L. Zhang
Alexander Robey
Andy Zou
Chengming Xu
Haoqi Hu
Matt Fredrikson
AAML
VLM
52
0
0
02 May 2025
Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation
Vaidehi Patil
Yi-Lin Sung
Peter Hase
Jie Peng
Tianlong Chen
Mohit Bansal
AAML
MU
79
3
0
01 May 2025
AdCare-VLM: Leveraging Large Vision Language Model (LVLM) to Monitor Long-Term Medication Adherence and Care
Md Asaduzzaman Jabin
Hanqi Jiang
Y. Li
Patrick Kaggwa
Eugene Douglass
Juliet N. Sekandi
Tianming Liu
LM&MA
69
0
0
01 May 2025
Robotic Visual Instruction
Y. Li
Ziyang Gong
H. Li
Xiaoqi Huang
Haolan Kang
Guangping Bai
Xianzheng Ma
LM&Ro
69
0
0
01 May 2025
RoboGround: Robotic Manipulation with Grounded Vision-Language Priors
Haifeng Huang
Xinyi Chen
Y. Chen
H. Li
Xiaoshen Han
Z. Wang
Tai Wang
Jiangmiao Pang
Zhou Zhao
LM&Ro
75
0
0
30 Apr 2025
X-Fusion: Introducing New Modality to Frozen Large Language Models
Sicheng Mo
Thao Nguyen
Xun Huang
Siddharth Srinivasan Iyer
Yijun Li
...
Eli Shechtman
Krishna Kumar Singh
Yong Jae Lee
Bolei Zhou
Yuheng Li
71
0
0
29 Apr 2025
CMT: A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation
Jianyu Wu
Yizhou Wang
Xiangyu Yue
Xinzhu Ma
J. Guo
Dongzhan Zhou
Wanli Ouyang
Shixiang Tang
66
0
0
29 Apr 2025
VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning
Run Luo
Renke Shan
Longze Chen
Z. Liu
Lu Wang
Min Yang
Xiaobo Xia
MLLM
VLM
92
0
0
28 Apr 2025
VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?
Mohamed Gado
Towhid Taliee
Muhammad Memon
D. Ignatov
Radu Timofte
63
0
0
27 Apr 2025
AI Awareness
X. Li
Haoyuan Shi
Rongwu Xu
Wei Xu
54
0
0
25 Apr 2025
Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark
Hanlei Zhang
Zhuohang Li
Yeshuang Zhu
Hua Xu
Peiwu Wang
Haige Zhu
Jie Zhou
Jinchao Zhang
25
0
0
23 Apr 2025
FaceInsight: A Multimodal Large Language Model for Face Perception
Jingzhi Li
Changjiang Luo
Ruoyu Chen
Hua Zhang
Wenqi Ren
Jianhou Gan
Xiaochun Cao
CVBM
LRM
57
0
0
22 Apr 2025
LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation
Jiachen Li
Qing Xie
Xiaohan Yu
Hongyun Wang
Jinyu Xu
Yongjian Liu
ObjD
76
0
0
20 Apr 2025
Learning Joint ID-Textual Representation for ID-Preserving Image Synthesis
Zichuan Liu
Liming Jiang
Qing Yan
Yumin Jia
Hao Kang
Xin Lu
DiffM
29
0
0
19 Apr 2025
How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?
Rahul Thapa
Andrew Li
Qingyang Wu
B. He
Yuki Sahashi
...
Angela Zhang
Ben Athiwaratkun
S. Song
David Ouyang
James Y. Zou
LM&MA
45
0
0
19 Apr 2025
Evaluating Menu OCR and Translation: A Benchmark for Aligning Human and Automated Evaluations in Large Vision-Language Models
Zhanglin Wu
Tengfei Song
Ning Xie
Weidong Zhang
Mengli Zhu
...
Pengfei Li
C. Li
Junhao Zhu
Hao-Yu Yang
Shiliang Sun
26
1
0
16 Apr 2025
The Mirage of Performance Gains: Why Contrastive Decoding Fails to Address Multimodal Hallucination
Hao Yin
Gunagzong Si
Zilei Wang
83
0
0
14 Apr 2025
Evolved Hierarchical Masking for Self-Supervised Learning
Zhanzhou Feng
Shiliang Zhang
37
0
0
12 Apr 2025
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Haozhe Wang
C. Qu
Zuming Huang
Wei Chu
Fangzhen Lin
Wenhu Chen
OffRL
ReLM
SyDa
LRM
VLM
72
1
0
10 Apr 2025
Are We Done with Object-Centric Learning?
Alexander Rubinstein
Ameya Prabhu
Matthias Bethge
Seong Joon Oh
OCL
568
0
0
09 Apr 2025
PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning
Xinpeng Ding
K. Zhang
Jinahua Han
Lanqing Hong
Hang Xu
X. Li
MLLM
VLM
112
0
0
08 Apr 2025
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks
Jiawei Wang
Yushen Zuo
Yuanjun Chai
Z. Liu
Yichen Fu
Yichun Feng
Kin-Man Lam
AAML
VLM
40
0
0
02 Apr 2025
Towards Understanding How Knowledge Evolves in Large Vision-Language Models
Sudong Wang
Y. Zhang
Yao Zhu
Jianing Li
Zizhe Wang
Y. Liu
Xiangyang Ji
69
0
0
31 Mar 2025
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
Size Wu
W. Zhang
Lumin Xu
Sheng Jin
Zhonghua Wu
Qingyi Tao
Wentao Liu
Wei Li
Chen Change Loy
VGen
91
2
0
27 Mar 2025
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks
W. Zhang
Mengna Wang
Gangao Liu
Xu Huixin
Yiwei Jiang
...
Hang Zhang
Xin Li
Weiming Lu
Peng Li
Y. Zhuang
LM&Ro
LRM
65
2
0
27 Mar 2025
MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Jiayi Ji
Jie Lou
Debing Zhang
Rongrong Ji
90
0
0
26 Mar 2025
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
Xiangrui Liu
Yan Shu
Zheng Liu
Ao Li
Yang Tian
Bo Zhao
VGen
VLM
86
0
0
24 Mar 2025
Lie Detector: Unified Backdoor Detection via Cross-Examination Framework
X. U. Wang
Siyuan Liang
Dongping Liao
Han Fang
Aishan Liu
Xiaochun Cao
Yu-liang Lu
E. Chang
X. Gao
AAML
50
1
0
21 Mar 2025
From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration
Mingyang Song
Xiaoye Qu
Jiawei Zhou
Yu-Xi Cheng
VLM
50
1
0
17 Mar 2025
TLAC: Two-stage LMM Augmented CLIP for Zero-Shot Classification
Ans Munir
Faisal Z. Qureshi
M. H. Khan
Mohsen Ali
VLM
70
0
0
15 Mar 2025
Cyclic Contrastive Knowledge Transfer for Open-Vocabulary Object Detection
Chuhan Zhang
Chaoyang Zhu
Pingcheng Dong
Long Chen
Dong Zhang
ObjD
VLM
105
0
0
14 Mar 2025
Direction-Aware Diagonal Autoregressive Image Generation
Yijia Xu
Jianzhong Ju
Jian Luan
J. Cui
47
0
0
14 Mar 2025
Learning to Inference Adaptively for Multimodal Large Language Models
Zhuoyan Xu
Khoi Duc Nguyen
Preeti Mukherjee
Saurabh Bagchi
Somali Chaterji
Yingyu Liang
Yin Li
LRM
42
1
0
13 Mar 2025
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search
Yiming Jia
J. Li
Xiang Yue
Bo Li
Ping Nie
Kai Zou
Wenhu Chen
LRM
74
2
0
13 Mar 2025
TruthPrInt: Mitigating LVLM Object Hallucination Via Latent Truthful-Guided Pre-Intervention
Jinhao Duan
Fei Kong
Hao-Ran Cheng
James Diffenderfer
B. Kailkhura
Lichao Sun
Xiaofeng Zhu
Xiaoshuang Shi
Kaidi Xu
89
0
0
13 Mar 2025
TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction
Xuying Zhang
Yutong Liu
Yangguang Li
Renrui Zhang
Y. Liu
...
Wanli Ouyang
Zhiwei Xiong
Peng Gao
Qibin Hou
Ming-Ming Cheng
118
3
0
13 Mar 2025
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
Md. Mohaiminul Islam
Tushar Nagarajan
Huiyu Wang
Gedas Bertasius
Lorenzo Torresani
96
0
0
12 Mar 2025
Attention Hijackers: Detect and Disentangle Attention Hijacking in LVLMs for Hallucination Mitigation
Beitao Chen
Xinyu Lyu
Lianli Gao
Jingkuan Song
H. Shen
63
1
0
11 Mar 2025
Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis
Letian Zhang
Quan Cui
Bingchen Zhao
Cheng Yang
MLLM
SyDa
49
0
0
11 Mar 2025
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
Shehreen Azad
Vibhav Vineet
Y. S. Rawat
VLM
84
1
0
11 Mar 2025
Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning
Bardia Safaei
Faizan Siddiqui
Jiacong Xu
Vishal M. Patel
Shao-Yuan Lo
VLM
104
0
0
10 Mar 2025
SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation
Z. Chen
Chunwei Wang
Xiuwei Chen
Hang Xu
J. Han
Xiandan Liang
VLM
69
1
0
09 Mar 2025
Less is More: Adaptive Program Repair with Bug Localization and Preference Learning
Zhenlong Dai
Bingrui Chen
Zhuoluo Zhao
Xiu Tang
Sai Wu
Chang Yao
Zhipeng Gao
Jingyuan Chen
KELM
39
2
0
09 Mar 2025
Treble Counterfactual VLMs: A Causal Approach to Hallucination
Li Li
Jiashu Qu
Yuxiao Zhou
Yuehan Qin
Tiankai Yang
Yue Zhao
81
1
0
08 Mar 2025
1
2
3
4
5
6
Next