Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2302.11713
Cited By
v1
v2
v3
v4
v5 (latest)
Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
23 February 2023
Yang Chen
Hexiang Hu
Yi Luan
Haitian Sun
Soravit Changpinyo
Alan Ritter
Ming-Wei Chang
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (1 upvotes)
Github
Papers citing
"Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?"
50 / 58 papers shown
Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering
Dosung Lee
Sangwon Jung
Boyoung Kim
Minyoung Kim
Sungyeon Kim
Junyoung Sung
Paul Hongsuck Seo
170
0
0
28 Nov 2025
ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
Alberto Compagnoni
Marco Morini
Sara Sarto
Federico Cocchi
Davide Caffagni
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
RALM
LRM
261
4
0
27 Nov 2025
SciEGQA: A Dataset for Scientific Evidence-Grounded Question Answering and Reasoning
Wenhan Yu
Wang Chen
Guanqiang Qi
Weikang Li
Yang Li
Lei Sha
Deguo Xia
Jizhou Huang
222
4
0
19 Nov 2025
HV-Attack: Hierarchical Visual Attack for Multimodal Retrieval Augmented Generation
Linyin Luo
Yujuan Ding
Yunshan Ma
Wenqi Fan
Hanjiang Lai
AAML
276
1
0
19 Nov 2025
DeepEyesV2: Toward Agentic Multimodal Model
IEEE Transactions on Audio, Speech, and Language Processing (TASLP), 2025
Jack Hong
Chenxiao Zhao
ChengLin Zhu
Weiheng Lu
Guohai Xu
Xing Yu
195
35
0
07 Nov 2025
Unified Reinforcement and Imitation Learning for Vision-Language Models
Byung-Kwan Lee
Ryo Hachiuma
Yong Man Ro
Yu-Chun Wang
Yueh-Hua Wu
VLM
225
4
0
22 Oct 2025
Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents
Yiqi Lin
Alex Jinpeng Wang
Linjie Li
Zhengyuan Yang
Mike Zheng Shou
177
1
0
21 Oct 2025
A Comprehensive Survey on Reinforcement Learning-based Agentic Search: Foundations, Roles, Optimizations, Evaluations, and Applications
Minhua Lin
Zongyu Wu
Zhichao Xu
Hui Liu
Xianfeng Tang
Qi He
Charu C. Aggarwal
Hui Liu
Xiang Zhang
Suhang Wang
AI4TS
LRM
634
9
0
19 Oct 2025
Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering
Yuyang Hong
Jiaqi Gu
Qi Yang
Lubin Fan
Yue-bo Wu
Ying Wang
Kun Ding
Shiming Xiang
Jieping Ye
263
10
0
16 Oct 2025
NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching
Run Luo
Xiaobo Xia
Lu Wang
Longze Chen
Renke Shan
Jing Luo
Min Yang
Tat-Seng Chua
VGen
314
13
0
15 Oct 2025
DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search
Kartik Narayan
Yang Xu
Tian Cao
Kavya Nerella
Vishal M. Patel
Navid Shiee
Peter Grasch
Chao Jia
Yinfei Yang
Zhe Gan
ObjD
KELM
VLM
306
18
0
14 Oct 2025
CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation
Kaiwen Wei
Xiao-Yang Liu
Jie Zhang
Zijian Wang
Ruida Liu
...
C. Pan
Y. Zhang
Jiang Zhong
Peijin Wang
Yingchao Feng
VGen
VLM
166
0
0
10 Oct 2025
MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval
Siyue Zhang
Yuan Gao
Xiao Zhou
Yilun Zhao
Tingyu Song
Arman Cohan
Anh Tuan Luu
Chen Zhao
VLM
LRM
184
1
0
10 Oct 2025
Retrv-R1: A Reasoning-Driven MLLM Framework for Universal and Efficient Multimodal Retrieval
Lanyun Zhu
Deyi Ji
Tianrun Chen
Haiyang Wu
Shiqi Wang
LRM
232
5
0
03 Oct 2025
Generalized Contrastive Learning for Universal Multimodal Retrieval
Jungsoo Lee
Janghoon Cho
Hyojin Park
Munawar Hayat
Kyuwoong Hwang
Fatih Porikli
Sungha Choi
VLM
234
4
0
30 Sep 2025
From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models
Chenyue Zhou
Mingxuan Wang
Yanbiao Ma
Chenxu Wu
Wanyi Chen
...
Guoli Jia
Lingling Li
Z. Lu
Y. Lu
Wenhan Luo
LRM
634
14
0
29 Sep 2025
Recurrence Meets Transformers for Universal Multimodal Retrieval
Davide Caffagni
Sara Sarto
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
280
2
0
10 Sep 2025
Global-to-Local or Local-to-Global? Enhancing Image Retrieval with Efficient Local Search and Effective Global Re-ranking
Dror Aiger
Bingyi Cao
Kaifeng Chen
A. Araújo
248
1
0
04 Sep 2025
CMRAG: Co-modality-based visual document retrieval and question answering
Wang Chen
Guanqiang Qi
Guanqiang Qi
Yang Li
Yang Li
Lei Sha
Deguo Xia
Jizhou Huang
315
0
0
02 Sep 2025
Multimodal Iterative RAG for Knowledge-Intensive Visual Question Answering
Changin Choi
Wonseok Lee
Jungmin Ko
Wonjong Rhee
VLM
LRM
359
0
0
31 Aug 2025
mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering
Xu Yuan
Liangbo Ning
Wenqi Fan
Qing Li
244
11
0
07 Aug 2025
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
Xin Guan
Peng Xia
Zhen Zhang
Xinyu Wang
Qiuchen Wang
...
Kuan Li
Yong Jiang
Pengjun Xie
Fei Huang
Jingren Zhou
399
59
0
07 Aug 2025
On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey
Meishan Zhang
Xin Zhang
X. Zhao
Shouzheng Huang
Baotian Hu
Min Zhang
370
4
0
28 Jul 2025
Augmented Vision-Language Models: A Systematic Review
Anthony C Davis
Burhan Sadiq
Tianmin Shu
Chien-Ming Huang
VLM
LRM
222
0
0
24 Jul 2025
Taming the Untamed: Graph-Based Knowledge Retrieval and Reasoning for MLLMs to Conquer the Unknown
Bowen Wang
Zhouqiang Jiang
Yasuaki Susumu
Shotaro Miwa
Tianwei Chen
Yuta Nakashima
412
2
0
21 Jun 2025
GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
Byung-Kwan Lee
Ryo Hachiuma
Yong Man Ro
Yu-Chun Wang
Yueh-Hua Wu
VLM
382
4
0
18 Jun 2025
CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Yang Tian
Fan Liu
Jingyuan Zhang
Victoria A. Webster-Wood
Yupeng Hu
Liqiang Nie
VLM
303
13
0
03 Jun 2025
mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation
Chan-wei Hu
Yueqi Wang
Shuo Xing
Chia-Ju Chen
Zhengzhong Tu
Ryan Rossi
Zhengzhong Tu
3DV
450
2
0
29 May 2025
Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation
Chunyi Peng
Zhipeng Xu
Zhenghao Liu
Yishan Li
Shi Yu
Shuo Wang
Zhiyuan Liu
Yu Gu
Minghe Yu
Ge Yu
MoE
KELM
LRM
367
3
0
28 May 2025
Spa-VLM: Stealthy Poisoning Attacks on RAG-based VLM
Lei Yu
Yechao Zhang
Ziqi Zhou
Yang Wu
Wei Wan
Minghui Li
Shengshan Hu
Pei Xiaobing
Jing Wang
AAML
298
3
0
28 May 2025
MMTBENCH: A Unified Benchmark for Complex Multimodal Table Reasoning
Prasham Yatinkumar Titiya
Jainil Trivedi
Chitta Baral
Vivek Gupta
LMTD
282
8
0
27 May 2025
OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Wei Yang
Jingjing Fu
Rongpin Wang
Jinyu Wang
Lei Song
Jiang Bian
445
8
0
10 May 2025
MIEB: Massive Image Embedding Benchmark
Chenghao Xiao
Isaac Chung
Imene Kerboua
Jamie Stirling
Xin Zhang
Márton Kardos
Roman Solomatin
Noura Al Moubayed
Kenneth Enevoldsen
Niklas Muennighoff
VLM
583
7
0
14 Apr 2025
Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook
Xu Zheng
Ziqiao Weng
Yuanhuiyi Lyu
Lutao Jiang
Haiwei Xue
Bin Ren
Danda Pani Paudel
Andrii Zadaianchuk
Luc Van Gool
Xuming Hu
3DV
418
35
0
23 Mar 2025
Poisoned-MRAG: Knowledge Poisoning Attacks to Multimodal Retrieval Augmented Generation
Yinuo Liu
Zenghui Yuan
Guiyao Tie
Jiawen Shi
Lichao Sun
Lichao Sun
Neil Zhenqiang Gong
641
7
0
08 Mar 2025
Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering
Zhengxuan Zhang
Yin Wu
Yuyu Luo
Nan Tang
440
0
0
28 Feb 2025
Joint Fusion and Encoding: Advancing Multimodal Retrieval from the Ground Up
Lang Huang
Qiyu Wu
Zhongtao Miao
T. Yamasaki
1.0K
6
0
27 Feb 2025
Detecting Knowledge Boundary of Vision Large Language Models by Sampling-Based Inference
Zhuo Chen
Xinyu Wang
Yong Jiang
Zhen Zhang
Xin Guan
Pengjun Xie
Fei Huang
Kewei Tu
517
8
0
25 Feb 2025
Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries
Yin Wu
Quanyu Long
Jing Li
Jianfei Yu
Wenya Wang
VLM
335
14
0
23 Feb 2025
Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines
AAAI Conference on Artificial Intelligence (AAAI), 2025
Xinwei Long
Zhiyuan Ma
Ermo Hua
Kaiyan Zhang
Biqing Qi
Bowen Zhou
RALM
418
16
0
23 Feb 2025
LOVA3: Learning to Visual Question Answering, Asking and Assessment
Neural Information Processing Systems (NeurIPS), 2024
Henry Hengyuan Zhao
Pan Zhou
Difei Gao
Zechen Bai
Mike Zheng Shou
458
18
0
21 Feb 2025
Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Mohammad Mahdi Abootorabi
Amirhosein Zobeiri
Mahdi Dehghani
Mohammadali Mohammadkhani
Bardia Mohammadi
Omid Ghahroodi
M. Baghshah
Ehsaneddin Asgari
RALM
841
43
0
12 Feb 2025
Generative Landmarks Guided Eyeglasses Removal 3D Face Reconstruction
Dapeng Zhao
Yue Qi
3DH
CVBM
3DV
373
1
0
31 Dec 2024
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
Xin Zhang
Yanzhao Zhang
Wen Xie
Mingxin Li
Ziqi Dai
Dingkun Long
Pengjun Xie
Meishan Zhang
Wenjie Li
Hao Fei
623
112
0
22 Dec 2024
Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Ido Cohen
Daniela Gottesman
Mor Geva
Raja Giryes
VLM
549
7
1
18 Dec 2024
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent
International Conference on Learning Representations (ICLR), 2024
Yangning Li
Hai-Tao Zheng
Xinyu Wang
Yong Jiang
Zhen Zhang
...
Hui Wang
Hai-Tao Zheng
Pengjun Xie
Philip S. Yu
Fei Huang
778
67
0
05 Nov 2024
MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs
International Conference on Learning Representations (ICLR), 2024
Sheng-Chieh Lin
Chankyu Lee
Mohammad Shoeybi
Jimmy J. Lin
Bryan Catanzaro
Ming-Yu Liu
1.0K
106
0
04 Nov 2024
MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models
International Conference on Learning Representations (ICLR), 2024
Wenbo Hu
Jia-Chen Gu
Zi-Yi Dou
Mohsen Fayyaz
Pan Lu
Kai-Wei Chang
Nanyun Peng
VLM
399
36
0
10 Oct 2024
EchoSight: Advancing Visual-Language Models with Wiki Knowledge
Yibin Yan
Weidi Xie
RALM
380
52
0
17 Jul 2024
SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs
Xin Su
Man Luo
Kris W Pan
Tien Pei Chou
Vasudev Lal
Phillip Howard
394
9
0
28 Jun 2024
1
2
Next
Page 1 of 2