Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2405.11985
Cited By
v1
v2
v3
v4
v5 (latest)
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering
20 May 2024
Jingqun Tang
Qi-dong Liu
Yongjie Ye
Jinghui Lu
Shubo Wei
Chunhui Lin
Wanqing Li
Mohamad Fitri Faiz Bin Mahmood
Hao Feng
Zhen Zhao
Yanjie Wang
Yuliang Liu
Hao Liu
Xiang Bai
Can Huang
Xiang Bai
Can Huang
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering"
50 / 82 papers shown
Title
Jina-VLM: Small Multilingual Vision Language Model
Andreas Koukounas
Georgios Mastrapas
Florian Hönicke
Sedigheh Eslami
Guillaume Roncari
Scott Martens
Han Xiao
MLLM
311
0
0
03 Dec 2025
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
Yunze Man
S. S. Wang
Guowen Zhang
Johan Bjorck
Zhiqi Li
Liang-Yan Gui
Jim Fan
Jan Kautz
Yu Wang
Zhiding Yu
121
0
0
25 Nov 2025
DocPTBench: Benchmarking End-to-End Photographed Document Parsing and Translation
Yongkun Du
Pinxuan Chen
Xuye Ying
Z. Chen
124
0
0
23 Nov 2025
VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging
Ming Zhong
Y. Wang
Liuzhou Zhang
Arctanx An
Renrui Zhang
Hao Liang
Ming Lu
Ying Shen
Wentao Zhang
216
0
0
22 Nov 2025
Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2025
David Acuna
Chao-Han Huck Yang
Yuntian Deng
Jaehun Jung
Ximing Lu
Prithviraj Ammanabrolu
Hyunwoo J. Kim
Yuan-Hong Liao
Yejin Choi
ReLM
OffRL
LRM
331
1
0
07 Nov 2025
Qianfan-VL: Domain-Enhanced Universal Vision-Language Models
Daxiang Dong
Mingming Zheng
Dong Xu
Bairong Zhuang
W. Zhang
...
Ruchang Yao
Ziye Yuan
J. Wu
Guangjun Xie
Dou Shen
VLM
83
1
0
19 Sep 2025
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang
Zhangwei Gao
Lixin Gu
Hengjun Pu
Long Cui
...
Bowen Zhou
Kai Chen
Yu Qiao
Wenhai Wang
Gen Luo
MLLM
LRM
290
246
0
25 Aug 2025
A Metric for MLLM Alignment in Large-scale Recommendation
Yubin Zhang
Yanhua Huang
Haiming Xu
Mingliang Qi
Chang Wang
Jiarui Jin
Xiangyuan Ren
Xiaodan Wang
Ruiwen Xu
OffRL
93
0
0
07 Aug 2025
HW-MLVQA: Elucidating Multilingual Handwritten Document Understanding with a Comprehensive VQA Benchmark
Aniket Pal
Ajoy Mondal
Minesh Mathew
C. V. Jawahar
VLM
88
0
0
21 Jul 2025
The Multilingual Divide and Its Impact on Global AI Safety
Aidan Peppin
Julia Kreutzer
Alice Schoenauer Sebag
Kelly Marchisio
Beyza Ermis
...
Wei-Yin Ko
Ahmet Üstün
Matthias Gallé
Marzieh Fadaee
Sara Hooker
ELM
300
2
0
27 May 2025
Chart-to-Experience: Benchmarking Multimodal LLMs for Predicting Experiential Impact of Charts
IEEE Pacific Visualization Symposium (PacificVis), 2025
Seon Gyeom Kim
Jae Young Choi
Ryan Rossi
Eunyee Koh
Tak Yeon Lee
259
2
0
23 May 2025
Prolonged Reasoning Is Not All You Need: Certainty-Based Adaptive Routing for Efficient LLM/MLLM Reasoning
Jinghui Lu
Haiyang Yu
Siliang Xu
Shiwei Ran
Guozhi Tang
...
Teng Fu
Hao Feng
Jingqun Tang
Hongru Wang
Can Huang
LRM
377
13
0
21 May 2025
VoQA: Visual-only Question Answering
Jianing An
Luyang Jiang
Jie Luo
Wenjun Wu
Lei Huang
LRM
311
0
0
20 May 2025
Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Hao Feng
Shu Wei
Xiang Fei
Wei Shi
Yingdong Han
...
Qi Liu
Chunhui Lin
Jingqun Tang
Hao Liu
Can Huang
312
18
0
20 May 2025
Advancing Sequential Numerical Prediction in Autoregressive Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Xiang Fei
Jinghui Lu
Qi Sun
Hao Feng
Yanjie Wang
Wei Shi
An-Lan Wang
Jingqun Tang
Can Huang
AI4TS
547
5
0
19 May 2025
Reasoning-OCR: Can Large Multimodal Models Solve Complex Logical Reasoning Problems from OCR Cues?
Haibin He
Maoyuan Ye
Jing Zhang
Xiantao Cai
Juhua Liu
Bo Du
Dacheng Tao
LRM
357
3
0
19 May 2025
LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?
Maoyuan Ye
Jing Zhang
Juhua Liu
Bo Du
Dacheng Tao
Bo Du
LRM
542
1
0
18 May 2025
WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?
An-Lan Wang
Jingqun Tang
Liao Lei
Hao Feng
Qi Liu
...
Wen Liu
Hao Liu
Wenshu Fan
Xiang Bai
Can Huang
379
3
0
16 May 2025
PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language
Ijazul Haq
Yingjie Zhang
Irfan Ali Khan
308
0
0
15 May 2025
Seed1.5-VL Technical Report
D. Guo
Faming Wu
Feida Zhu
Fuxing Leng
Guang Shi
...
Kai Hua
Kai Liu
Kai Shen
Jianchao Tan
Ke Shen
MLLM
VLM
LRM
191
158
0
11 May 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
Xuzhao Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Qi Zhang
Tat-Seng Chua
Tianwei Zhang
ALM
ELM
528
22
0
26 Apr 2025
Benchmarking Vision Language Models on German Factual Data
Artificial Intelligence Applications and Innovations (AIAI), 2025
René Peinl
Vincent Tischler
CoGe
331
1
0
15 Apr 2025
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu
Weiyun Wang
Zhe Chen
Ziwei Liu
Shenglong Ye
...
Dahua Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
Wei Wang
MLLM
VLM
537
770
1
14 Apr 2025
XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?
Computer Vision and Pattern Recognition (CVPR), 2025
Fengxiang Wang
Hongru Wang
Mingshuo Chen
Haiyan Zhao
Yulin Wang
...
L. Lan
Wenjing Yang
Jing Zhang
Zhiyuan Liu
Maosong Sun
316
24
0
31 Mar 2025
TDRI: Two-Phase Dialogue Refinement and Co-Adaptation for Interactive Image Generation
Yuheng Feng
Jianhui Wang
Kun Li
Sida Li
Tianyu Shi
Haoyue Han
Miao Zhang
Xueqian Wang
DiffM
1.1K
0
0
22 Mar 2025
PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks
Feng Ni
Kui Huang
Yao Lu
Wenyu Lv
Guanzhong Wang
Zeyu Chen
Wenshu Fan
VLM
440
2
0
06 Mar 2025
Task-Oriented 6-DoF Grasp Pose Detection in Clutters
IEEE International Conference on Robotics and Automation (ICRA), 2025
An-Lan Wang
Nuo Chen
Kun-Yu Lin
Li Yuan-Ming
Wei-Shi Zheng
305
5
0
24 Feb 2025
Cross-Modal Synergies: Unveiling the Potential of Motion-Aware Fusion Networks in Handling Dynamic and Static ReID Scenarios
Fuxi Ling
Hongye Liu
Guoqiang Huang
Jing Li
Hong Wu
Zhihao Tang
419
0
0
02 Feb 2025
Detection of AI Deepfake and Fraud in Online Payments Using GAN-Based Models
Zong Ke
Shicheng Zhou
Yining Zhou
Chia Hong Chang
Rong Zhang
297
30
0
13 Jan 2025
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
Ling Fu
Biao Yang
Zhebin Kuang
Jiajun Song
Yuzhe Li
...
Jingqun Tang
Wei Chen
Lianwen Jin
Yunxing Liu
Xiang Bai
339
22
0
31 Dec 2024
Attentive Eraser: Unleashing Diffusion Model's Object Removal Potential via Self-Attention Redirection Guidance
AAAI Conference on Artificial Intelligence (AAAI), 2024
Wenhao Sun
Benlei Cui
Xue-Mei Dong
Jingqun Tang
DiffM
755
29
0
17 Dec 2024
MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark
Bin Shan
Xiang Fei
Wei Shi
An-Lan Wang
Guozhi Tang
Lei Liao
Jingqun Tang
Xiang Bai
Can Huang
VLM
221
7
0
15 Oct 2024
SELU: Self-Learning Embodied MLLMs in Unknown Environments
Boyu Li
Haobin Jiang
Haobin Jiang
Weishuai Zeng
Haoran Li
Dongbin Zhao
Zongqing Lu
LRM
184
6
0
04 Oct 2024
A Survey on Multimodal Benchmarks: In the Era of Large AI Models
Lin Li
Guikun Chen
Hanrong Shi
Jun Xiao
Long Chen
335
23
0
21 Sep 2024
A Survey on Evaluation of Multimodal Large Language Models
Jiaxing Huang
Jingyi Zhang
LM&MA
ELM
LRM
298
42
0
28 Aug 2024
ParGo: Bridging Vision-Language with Partial and Global Views
AAAI Conference on Artificial Intelligence (AAAI), 2024
An-Lan Wang
Bin Shan
Wei Shi
Kun-Yu Lin
Xiang Fei
Guozhi Tang
Lei Liao
Jingqun Tang
Can Huang
Wei-Shi Zheng
MLLM
VLM
502
21
0
23 Aug 2024
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
International Conference on Learning Representations (ICLR), 2024
Yi-Fan Zhang
Huanyu Zhang
Haochen Tian
Chaoyou Fu
Shuangqing Zhang
...
Qingsong Wen
Zhang Zhang
Liwen Wang
Rong Jin
Tieniu Tan
OffRL
346
131
0
23 Aug 2024
Contextual Bandits for Unbounded Context Distributions
Puning Zhao
Yan Han
Zhe Liu
Huiwen Wu
Qin Zhang
Zong Ke
Tianhang Zheng
517
12
0
19 Aug 2024
Harmonizing Visual Text Comprehension and Generation
Zhen Zhao
Jingqun Tang
Binghong Wu
Chunhui Lin
Shubo Wei
Hao Liu
Xin Tan
Zhizhong Zhang
Can Huang
Yuan Xie
VLM
312
37
0
23 Jul 2024
IMAGDressing-v1: Customizable Virtual Dressing
Fei Shen
Xin Jiang
Xin He
Hu Ye
Cong Wang
Yutong Gao
Zechao Li
Jinghui Tang
DiffM
261
99
0
17 Jul 2024
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
Haodong Duan
Xinyu Fang
Junming Yang
Xiangyu Zhao
Lin Chen
...
Yuhang Zang
Pan Zhang
Jiaqi Wang
Dahua Lin
Kai Chen
LM&MA
VLM
708
354
0
16 Jul 2024
A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding
Jinghui Lu
Haiyang Yu
Yanjie Wang
Yongjie Ye
Jingqun Tang
...
Qi Liu
Hao Feng
Han Wang
Hao Liu
Can Huang
608
34
0
02 Jul 2024
DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
Jiaxin Zhang
Wentao Yang
Songxuan Lai
Zecheng Xie
Lianwen Jin
359
28
0
27 Jun 2024
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark
Neural Information Processing Systems (NeurIPS), 2024
David Romero
Chenyang Lyu
Haryo Akbarianto Wibowo
Teresa Lynn
Injy Hamed
...
Oana Ignat
Joan Nwatu
Amélie Reymond
Thamar Solorio
Alham Fikri Aji
301
84
0
10 Jun 2024
TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy
Weichao Zhao
Hao Feng
Qi Liu
Jingqun Tang
Shubo Wei
...
Lei Liao
Yongjie Ye
Hao Liu
Houqiang Li
Can Huang
LMTD
265
46
0
03 Jun 2024
ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images
Huy Quang Pham
Thang Kien-Bao Nguyen
Quan Van Nguyen
Dan Quang Tran
Nghia Hieu Nguyen
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
209
10
0
29 Apr 2024
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen
Weiyun Wang
Hao Tian
Shenglong Ye
Zhangwei Gao
...
Tong Lu
Dahua Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
MLLM
VLM
514
975
0
25 Apr 2024
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Jingqun Tang
Chunhui Lin
Zhen Zhao
Shubo Wei
Binghong Wu
...
Yuliang Liu
Xiang Bai
Can Huang
Xiang Bai
Can Huang
LRM
VLM
MLLM
444
42
0
19 Apr 2024
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Neural Information Processing Systems (NeurIPS), 2024
Xiao-wen Dong
Pan Zhang
Yuhang Zang
Yuhang Cao
Sijin Yu
...
Xingcheng Zhang
Jifeng Dai
Yuxin Qiao
Dahua Lin
Yuan Liu
VLM
MLLM
260
159
0
09 Apr 2024
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Yanwei Li
Yuechen Zhang
Chengyao Wang
Zhisheng Zhong
Yixin Chen
Ruihang Chu
Shaoteng Liu
Jiaya Jia
VLM
MLLM
MoE
382
323
0
27 Mar 2024
1
2
Next