Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2206.01718
Cited By
A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge
3 June 2022
Dustin Schwenk
Apoorv Khandelwal
Christopher Clark
Kenneth Marino
Roozbeh Mottaghi
Re-assign community
ArXiv
PDF
HTML
Papers citing
"A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge"
50 / 395 papers shown
Title
Visual Instruction Tuning with Chain of Region-of-Interest
Yixin Chen
Shuai Zhang
Boran Han
Bernie Wang
26
0
0
11 May 2025
OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval
Wei Yang
Jingjing Fu
R. Wang
Jinyu Wang
Lei Song
Jiang Bian
24
0
0
10 May 2025
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
X. Zhang
Jintao Guo
Shanshan Zhao
Minghao Fu
Lunhao Duan
Guo-Hua Wang
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
DiffM
74
0
0
05 May 2025
SEFE: Superficial and Essential Forgetting Eliminator for Multimodal Continual Instruction Tuning
Jinpeng Chen
Runmin Cong
Yuzhi Zhao
Hongzheng Yang
Guangneng Hu
H. Ip
Sam Kwong
CLL
KELM
83
0
0
05 May 2025
Rethinking Visual Layer Selection in Multimodal LLMs
H. Chen
Junyan Lin
Xinhao Chen
Yue Fan
Xin Jin
Hui Su
Jianfeng Dong
Jinlan Fu
Xiaoyu Shen
VLM
95
0
0
30 Apr 2025
DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding
Geng Li
Jinglin Xu
Yunzhen Zhao
Yuxin Peng
ObjD
32
0
0
21 Apr 2025
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya
Po-Yao (Bernie) Huang
Peize Sun
Jang Hyun Cho
Andrea Madotto
...
Shiyu Dong
Nikhila Ravi
Daniel Li
Piotr Dollár
Christoph Feichtenhofer
ObjD
VOS
103
0
0
17 Apr 2025
Efficient Contrastive Decoding with Probabilistic Hallucination Detection - Mitigating Hallucinations in Large Vision Language Models -
Laura Fieback
Nishilkumar Balar
Jakob Spiegelberg
Hanno Gottschalk
MLLM
VLM
78
0
0
16 Apr 2025
The Mirage of Performance Gains: Why Contrastive Decoding Fails to Address Multimodal Hallucination
Hao Yin
Gunagzong Si
Zilei Wang
142
0
0
14 Apr 2025
Resampling Benchmark for Efficient Comprehensive Evaluation of Large Vision-Language Models
Teppei Suzuki
Keisuke Ozawa
VLM
46
0
0
14 Apr 2025
MMKB-RAG: A Multi-Modal Knowledge-Based Retrieval-Augmented Generation Framework
Zihan Ling
Zhiyao Guo
Yixuan Huang
Yi An
Shuai Xiao
Jinsong Lan
Xiaoyong Zhu
Bo Zheng
RALM
VLM
55
0
0
14 Apr 2025
Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models
Xingguang Ji
Jiakang Wang
Hongzhi Zhang
Jingyuan Zhang
Haonan Zhou
Chenxi Sun
Y. Liu
Qi Wang
Fuzheng Zhang
MLLM
VLM
58
0
0
10 Apr 2025
Data Metabolism: An Efficient Data Design Schema For Vision Language Model
Jingyuan Zhang
Hongzhi Zhang
Zhou Haonan
Chenxi Sun
Xingguang Ji
Jiakang Wang
Fanheng Kong
Y. Liu
Qi Wang
Fuzheng Zhang
VLM
60
1
0
10 Apr 2025
Resource-efficient Inference with Foundation Model Programs
Lunyiu Nie
Zhimin Ding
Kevin Yu
Marco Cheung
C. Jermaine
S. Chaudhuri
30
0
0
09 Apr 2025
Classifying the Unknown: In-Context Learning for Open-Vocabulary Text and Symbol Recognition
Tom Simon
William Mocaer
Pierrick Tranouez
Clément Chatelain
Thierry Paquet
MLLM
VLM
51
0
0
09 Apr 2025
LiveVQA: Live Visual Knowledge Seeking
Mingyang Fu
Yuyang Peng
Benlin Liu
Yao Wan
D. Z. Chen
28
0
0
07 Apr 2025
M2IV: Towards Efficient and Fine-grained Multimodal In-Context Learning in Large Vision-Language Models
Yanshu Li
Hongyang He
Yi Cao
Qisen Cheng
Xiang Fu
Ruixiang Tang
VLM
40
0
0
06 Apr 2025
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks
Jiawei Wang
Yushen Zuo
Yuanjun Chai
Z. Liu
Yichen Fu
Yichun Feng
Kin-Man Lam
AAML
VLM
42
0
0
02 Apr 2025
FakeScope: Large Multimodal Expert Model for Transparent AI-Generated Image Forensics
Yixuan Li
Yu Tian
Yipo Huang
Wei Lu
Shiqi Wang
Weisi Lin
Anderson de Rezende Rocha
59
0
0
31 Mar 2025
Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis
J. Huang
Baoxiong Jia
Y. Wang
Ziyu Zhu
Xiongkun Linghu
Qing Li
Song-Chun Zhu
Siyuan Huang
84
3
0
28 Mar 2025
Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users
Antonia Karamolegkou
Malvina Nikandrou
Georgios Pantazopoulos
Danae Sanchez Villegas
Phillip Rust
Ruchira Dhar
Daniel Hershcovich
Anders Søgaard
39
0
0
28 Mar 2025
Learning to Instruct for Visual Instruction Tuning
Zhihan Zhou
Feng Hong
Jiaan Luo
Jiangchao Yao
Dongsheng Li
Bo Han
Y. Zhang
Yanfeng Wang
VLM
66
0
0
28 Mar 2025
Breaking Language Barriers in Visual Language Models via Multilingual Textual Regularization
Iñigo Pikabea
Iñaki Lacunza
Oriol Pareras
Carlos Escolano
Aitor Gonzalez-Agirre
Javier Hernando
Marta Villegas
VLM
52
0
0
28 Mar 2025
MAVERIX: Multimodal Audio-Visual Evaluation Reasoning IndeX
Liuyue Xie
George Z. Wei
Avik Kuthiala
Ce Zheng
Ananya Bal
...
Rohan Choudhury
Morteza Ziyadi
Xu Zhang
Hao Yang
László A. Jeni
64
0
0
27 Mar 2025
MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Jiayi Ji
Jie Lou
Debing Zhang
Rongrong Ji
95
0
0
26 Mar 2025
Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs
Zitian Wang
Yue Liao
Kang Rong
Fengyun Rao
Yibo Yang
Si Liu
72
0
0
26 Mar 2025
DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning
Fucai Ke
Vijay Kumar B G
Xingjian Leng
Zhixi Cai
Zaid Khan
Weiqing Wang
P. D. Haghighi
H. Rezatofighi
Manmohan Chandraker
44
0
0
25 Mar 2025
Bridging Writing Manner Gap in Visual Instruction Tuning by Creating LLM-aligned Instructions
Dong Jing
Nanyi Fei
Zhiwu Lu
42
0
0
24 Mar 2025
Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook
Xu Zheng
Ziqiao Weng
Yuanhuiyi Lyu
Lutao Jiang
Haiwei Xue
Bin Ren
Danda Pani Paudel
N. Sebe
Luc Van Gool
Xuming Hu
3DV
39
1
0
23 Mar 2025
Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models
Jin Wang
Chenghui Lv
Xian Li
Shichao Dong
Huadong Li
Kelu Yao
Chao Li
Wenqi Shao
Ping Luo
62
1
0
19 Mar 2025
Neuro Symbolic Knowledge Reasoning for Procedural Video Question Answering
Thanh-Son Nguyen
Hong Yang
Tzeh Yuan Neoh
Hao Zhang
Ee Yeo Keat
Basura Fernando
NAI
56
0
0
19 Mar 2025
Squeeze Out Tokens from Sample for Finer-Grained Data Governance
Weixiong Lin
Chen Ju
Haicheng Wang
Shengchao Hu
Shuai Xiao
...
Yuheng Jiao
Mingshuai Yao
Jinsong Lan
Qingwen Liu
Ying Chen
50
0
0
18 Mar 2025
ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large language Models
Hao Yin
Guangzong Si
Zilei Wang
137
1
0
17 Mar 2025
Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
Lijie Fan
Luming Tang
Siyang Qin
Tianhong Li
Xuan S. Yang
...
Tao Zhu
Michael Rubinstein
Michalis Raptis
Deqing Sun
Radu Soricut
54
4
0
17 Mar 2025
Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
Hao Yin
Guangzong Si
Zilei Wang
46
0
0
17 Mar 2025
Federated Continual Instruction Tuning
Haiyang Guo
Fanhu Zeng
Fei Zhu
Wenzhuo Liu
Da-Han Wang
Jian Xu
Xu-Yao Zhang
Cheng-Lin Liu
CLL
FedML
65
1
0
17 Mar 2025
A Survey on the Optimization of Large Language Model-based Agents
Shangheng Du
Jiabao Zhao
Jinxin Shi
Zhentao Xie
Xin Jiang
Yanhong Bai
Liang He
LLMAG
LM&Ro
LM&MA
209
1
0
16 Mar 2025
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Y. Wang
Shengqiong Wu
Y. Zhang
William Yang Wang
Ziwei Liu
Jiebo Luo
Hao Fei
LRM
92
8
0
16 Mar 2025
TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models
Xudong Tan
Peng Ye
Chongjun Tu
Jianjian Cao
Yaoxin Yang
Lin Zhang
Dongzhan Zhou
Tao Chen
VLM
56
0
0
13 Mar 2025
Teaching LMMs for Image Quality Scoring and Interpreting
Zicheng Zhang
H. Wu
Ziheng Jia
Weisi Lin
Guangtao Zhai
60
1
0
12 Mar 2025
Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning
Bardia Safaei
Faizan Siddiqui
Jiacong Xu
Vishal M. Patel
Shao-Yuan Lo
VLM
166
0
0
10 Mar 2025
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang
Bohan Jia
Zijie Zhai
Shaosheng Cao
Zheyu Ye
Fei Zhao
Zhe Xu
Yao Hu
Shaohui Lin
MU
OffRL
LRM
MLLM
ReLM
VLM
57
41
0
09 Mar 2025
VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering
Yanling Wang
Yihan Zhao
Xiaodong Chen
Shasha Guo
Lixin Liu
Haoyang Li
Yong Xiao
J. Zhang
Qi Li
Ke Xu
44
1
0
09 Mar 2025
Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices
Junyan Lin
Haoran Chen
Yue Fan
Yingqi Fan
Xin Jin
Hui Su
Jinlan Fu
Xiaoyu Shen
63
0
0
08 Mar 2025
Treble Counterfactual VLMs: A Causal Approach to Hallucination
Li Li
Jiashu Qu
Yuxiao Zhou
Yuehan Qin
Tiankai Yang
Yue Zhao
92
2
0
08 Mar 2025
PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks
Feng Ni
Kui Huang
Yao Lu
Wenyu Lv
Guanzhong Wang
Zeyu Chen
Y. Liu
VLM
48
0
0
06 Mar 2025
Multi-modal Summarization in Model-Based Engineering: Automotive Software Development Case Study
Nenad Petrovic
Yurui Zhang
Moaad Maaroufi
Kuo-Yi Chao
Lukasz Mazur
F. Pan
Vahid Zolfaghari
Alois C. Knoll
62
0
0
06 Mar 2025
See What You Are Told: Visual Attention Sink in Large Multimodal Models
Seil Kang
Jinyeong Kim
Junhyeok Kim
Seong Jae Hwang
VLM
107
6
0
05 Mar 2025
Words or Vision: Do Vision-Language Models Have Blind Faith in Text?
Ailin Deng
Tri Cao
Zhirui Chen
Bryan Hooi
VLM
96
2
0
04 Mar 2025
Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs
Wei-Yao Wang
Zhao Wang
Helen Suzuki
Yoshiyuki Kobayashi
LRM
55
1
0
04 Mar 2025
1
2
3
4
5
6
7
8
Next