ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.18603
  4. Cited By
Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning

Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning

24 May 2025
Ye Mo
Zirui Shao
Kai Ye
Xianwei Mao
Bo Zhang
Hangdi Xing
Peng Ye
Gang Huang
Kehan Chen
Zhou Huan
Zixu Yan
Sheng Zhou
    LRM
ArXiv (abs)PDFHTML

Papers citing "Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning"

42 / 42 papers shown
How Real Is AI Tutoring? Comparing Simulated and Human Dialogues in One-on-One Instruction
How Real Is AI Tutoring? Comparing Simulated and Human Dialogues in One-on-One Instruction
Ruijia Li
Yuan-Hao Jiang
Jiatong Wang
Bo Jiang
132
0
0
02 Sep 2025
DocR1: Evidence Page-Guided GRPO for Multi-Page Document Understanding
DocR1: Evidence Page-Guided GRPO for Multi-Page Document Understanding
Junyu Xiong
Yonghui Wang
Weichao Zhao
Chenyu Liu
Bing Yin
Wengang Zhou
Houqiang Li
LRM
201
4
0
10 Aug 2025
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Longji Xu
Shengqiong Wu
Yujiao Shi
William Yang Wang
Ziwei Liu
Jiebo Luo
Hao Fei
LRM
602
112
0
16 Mar 2025
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI
Daya Guo
Dejian Yang
Haowei Zhang
Junxiao Song
...
Shiyu Wang
S. Yu
Shunfeng Zhou
Shuting Pan
S.S. Li
OffRLAI4TSLRMReLMVLM
1.2K
5,342
0
22 Jan 2025
Object-level Visual Prompts for Compositional Image Generation
Gaurav Parmar
Or Patashnik
Kuan-Chieh Wang
Daniil Ostashev
Srinivasa Narasimhan
Jun-Yan Zhu
Daniel Cohen-Or
Kfir Aberman
DiffM
197
14
0
03 Jan 2025
C3oT: Generating Shorter Chain-of-Thought without Compromising
  Effectiveness
C3oT: Generating Shorter Chain-of-Thought without Compromising EffectivenessAAAI Conference on Artificial Intelligence (AAAI), 2024
Yu Kang
Xianghui Sun
Liangyu Chen
Wei Zou
LRM
472
114
0
16 Dec 2024
Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large
  Language Models without Fine-Tuning
Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-TuningAAAI Conference on Artificial Intelligence (AAAI), 2024
Hai-Ming Xu
Qi Chen
Lei Wang
Lingqiao Liu
297
9
0
14 Dec 2024
DOGR: Towards Versatile Visual Document Grounding and Referring
DOGR: Towards Versatile Visual Document Grounding and Referring
Yinan Zhou
Yuxin Chen
Haokun Lin
Shuyu Yang
Li Zhu
Chen Ma
Chen Ma
Mingyu Ding
Ying Shan
ObjD
557
4
0
26 Nov 2024
MinerU: An Open-Source Solution for Precise Document Content Extraction
MinerU: An Open-Source Solution for Precise Document Content Extraction
Bin Wang
Chao Xu
Xiaomeng Zhao
Linke Ouyang
Fan Wu
...
Wei Li
Botian Shi
Yu Qiao
Dahua Lin
Conghui He
192
133
0
27 Sep 2024
Attention Prompting on Image for Large Vision-Language Models
Attention Prompting on Image for Large Vision-Language ModelsEuropean Conference on Computer Vision (ECCV), 2024
Runpeng Yu
Weihao Yu
Xinchao Wang
VLM
393
28
0
25 Sep 2024
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li
Yuanhan Zhang
Dong Guo
Renrui Zhang
Feng Li
Hao Zhang
Kaichen Zhang
Yanwei Li
Ziwei Liu
Chunyuan Li
MLLMSyDaVLM
573
1,767
0
06 Aug 2024
Token-level Correlation-guided Compression for Efficient Multimodal
  Document Understanding
Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding
Renshan Zhang
Yibo Lyu
Rui Shao
Gongwei Chen
Weili Guan
Liqiang Nie
222
19
0
19 Jul 2024
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
  Models with Open-Source Suites
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen
Weiyun Wang
Hao Tian
Shenglong Ye
Zhangwei Gao
...
Tong Lu
Dahua Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
MLLMVLM
530
994
0
25 Apr 2024
TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding
TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding
Bozhi Luan
Hao Feng
Hong Chen
Yonghui Wang
Wen-gang Zhou
Houqiang Li
MLLM
244
26
0
15 Apr 2024
LayoutLLM: Layout Instruction Tuning with Large Language Models for
  Document Understanding
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding
Chuwei Luo
Yufan Shen
Zhaoqing Zhu
Qi Zheng
Zhi Yu
Cong Yao
379
98
0
08 Apr 2024
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document
  Understanding
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
Anwen Hu
Haiyang Xu
Jiabo Ye
Mingshi Yan
Liang Zhang
...
Chen Li
Ji Zhang
Qin Jin
Fei Huang
Jingren Zhou
VLM
309
199
0
19 Mar 2024
InternVL: Scaling up Vision Foundation Models and Aligning for Generic
  Visual-Linguistic Tasks
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen
Jiannan Wu
Wenhai Wang
Weijie Su
Guo Chen
...
Bin Li
Ping Luo
Tong Lu
Yu Qiao
Jifeng Dai
VLMMLLM
641
2,210
0
21 Dec 2023
Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in
  Language Models
Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in Language ModelsAAAI Conference on Artificial Intelligence (AAAI), 2023
Liqi He
Zuchao Li
Xiantao Cai
Ping Wang
LRM
194
34
0
14 Dec 2023
Attention Where It Matters: Rethinking Visual Document Understanding
  with Selective Region Concentration
Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region ConcentrationIEEE International Conference on Computer Vision (ICCV), 2023
H. Cao
Changcun Bao
Chaohu Liu
Huang-wei Chen
Kun Yin
Hao Liu
Yinsong Liu
Deqiang Jiang
Xing Sun
202
17
0
03 Sep 2023
Qwen-VL: A Versatile Vision-Language Model for Understanding,
  Localization, Text Reading, and Beyond
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai
Shuai Bai
Shusheng Yang
Shijie Wang
Sinan Tan
Peng Wang
Junyang Lin
Chang Zhou
Jingren Zhou
MLLMVLMObjD
535
1,598
0
24 Aug 2023
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench: Is Your Multi-modal Model an All-around Player?European Conference on Computer Vision (ECCV), 2023
Yuanzhan Liu
Haodong Duan
Yuanhan Zhang
Yue Liu
Songyang Zhang
...
Yuan Liu
Conghui He
Ziwei Liu
Kai-xiang Chen
Dahua Lin
713
1,664
0
12 Jul 2023
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
  Understanding
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Jiabo Ye
Anwen Hu
Haiyang Xu
Qinghao Ye
Mingshi Yan
...
Chenliang Li
Junfeng Tian
Qiang Qi
Ji Zhang
Feiyan Huang
VLMMLLM
236
156
0
04 Jul 2023
Fine-Grained Visual Prompting
Fine-Grained Visual PromptingNeural Information Processing Systems (NeurIPS), 2023
Lingfeng Yang
Yueze Wang
Xiang Li
Xinlong Wang
Jian Yang
ObjDVLM
245
98
0
07 Jun 2023
Layout and Task Aware Instruction Prompt for Zero-shot Document Image
  Question Answering
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering
Wenjin Wang
Yunhao Li
Yixin Ou
Yin Zhang
VLM
420
35
0
01 Jun 2023
Document Understanding Dataset and Evaluation (DUDE)
Document Understanding Dataset and Evaluation (DUDE)IEEE International Conference on Computer Vision (ICCV), 2023
Jordy Van Landeghem
Rubèn Pérez Tito
Łukasz Borchmann
Michal Pietruszka
Pawel Józiak
...
Bertrand Ackaert
Ernest Valveny
Matthew Blaschko
Sien Moens
Tomasz Stanislawek
VGen
302
111
0
15 May 2023
Structured Chain-of-Thought Prompting for Code Generation
Structured Chain-of-Thought Prompting for Code GenerationACM Transactions on Software Engineering and Methodology (TOSEM), 2023
Jia Li
Ge Li
Yongming Li
Zhi Jin
LRM
450
254
0
11 May 2023
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large
  Language Model Signals for Science Question Answering
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large Language Model Signals for Science Question AnsweringAAAI Conference on Artificial Intelligence (AAAI), 2023
Lei Wang
Yilang Hu
Jiabang He
Xingdong Xu
Ning Liu
Hui-juan Liu
Hengtao Shen
LRMMLLM
363
82
0
05 May 2023
GeoLayoutLM: Geometric Pre-training for Visual Information Extraction
GeoLayoutLM: Geometric Pre-training for Visual Information ExtractionComputer Vision and Pattern Recognition (CVPR), 2023
Chuwei Luo
Changxu Cheng
Qi Zheng
Cong Yao
267
62
0
21 Apr 2023
Progressive Visual Prompt Learning with Contrastive Feature Re-formation
Progressive Visual Prompt Learning with Contrastive Feature Re-formationInternational Journal of Computer Vision (IJCV), 2023
C. Xu
Yuhan Zhu
Haocheng Shen
Fengyuan Shi
Boheng Chen
Yixuan Liao
Xiaoxin Chen
Limin Wang
VLM
297
47
0
17 Apr 2023
What does CLIP know about a red circle? Visual prompt engineering for
  VLMs
What does CLIP know about a red circle? Visual prompt engineering for VLMsIEEE International Conference on Computer Vision (ICCV), 2023
Aleksandar Shtedritski
Christian Rupprecht
Andrea Vedaldi
VLMMLLM
383
231
0
13 Apr 2023
Multimodal Chain-of-Thought Reasoning in Language Models
Multimodal Chain-of-Thought Reasoning in Language Models
Zhuosheng Zhang
Aston Zhang
Mu Li
Hai Zhao
George Karypis
Alexander J. Smola
LRM
489
712
0
02 Feb 2023
VRDU: A Benchmark for Visually-rich Document Understanding
VRDU: A Benchmark for Visually-rich Document UnderstandingKnowledge Discovery and Data Mining (KDD), 2022
Zilong Wang
Yichao Zhou
Wei Wei
Chen-Yu Lee
Sandeep Tata
156
26
0
15 Nov 2022
Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding
Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document UnderstandingInternational Journal on Document Analysis and Recognition (IJDAR), 2022
Chuwei Luo
Guozhi Tang
Qi Zheng
Cong Yao
Lianwen Jin
Chenliang Li
Yang Xue
Luo Si
266
22
0
27 Jun 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language ModelsNeural Information Processing Systems (NeurIPS), 2022
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&RoLRMAI4CEReLM
2.3K
14,735
0
28 Jan 2022
Document AI: Benchmarks, Models and Applications
Document AI: Benchmarks, Models and Applications
Lei Cui
Yiheng Xu
Tengchao Lv
Furu Wei
VLM
245
93
0
16 Nov 2021
FeTaQA: Free-form Table Question Answering
FeTaQA: Free-form Table Question AnsweringTransactions of the Association for Computational Linguistics (TACL), 2021
Linyong Nan
Chia-Hsuan Hsieh
Ziming Mao
Xi Lin
Neha Verma
...
Isabel Trindade
Renusree Bandaru
Jacob Cunningham
Caiming Xiong
Dragomir R. Radev
LMTD
339
216
0
01 Apr 2021
ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction
ICDAR2019 Competition on Scanned Receipt OCR and Information ExtractionIEEE International Conference on Document Analysis and Recognition (ICDAR), 2019
Zheng Huang
Kai Chen
Jianhua He
X. Bai
Dimosthenis Karatzas
Shijian Lu
C. V. Jawahar
208
381
0
18 Mar 2021
DocVQA: A Dataset for VQA on Document Images
DocVQA: A Dataset for VQA on Document Images
Minesh Mathew
Dimosthenis Karatzas
C. V. Jawahar
703
1,117
0
01 Jul 2020
LayoutLM: Pre-training of Text and Layout for Document Image
  Understanding
LayoutLM: Pre-training of Text and Layout for Document Image UnderstandingKnowledge Discovery and Data Mining (KDD), 2019
Yiheng Xu
Minghao Li
Lei Cui
Shaohan Huang
Furu Wei
Ming Zhou
445
886
0
31 Dec 2019
PubLayNet: largest dataset ever for document layout analysis
PubLayNet: largest dataset ever for document layout analysisIEEE International Conference on Document Analysis and Recognition (ICDAR), 2019
Xu Zhong
Jianbin Tang
Antonio Jimeno Yepes
209
552
0
16 Aug 2019
ICDAR 2019 Competition on Scene Text Visual Question Answering
ICDAR 2019 Competition on Scene Text Visual Question AnsweringIEEE International Conference on Document Analysis and Recognition (ICDAR), 2019
Ali Furkan Biten
Rubèn Pérez Tito
Andrés Mafla
Lluís Gómez
Marçal Rusiñol
Minesh Mathew
C. V. Jawahar
Ernest Valveny
Dimosthenis Karatzas
239
83
0
30 Jun 2019
FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents
FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents
Guillaume Jaume
H. K. Ekenel
Jean-Philippe Thiran
498
453
0
27 May 2019
1
Page 1 of 1