Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning

24 May 2025

Papers citing "Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning"

42 / 42 papers shown

How Real Is AI Tutoring? Comparing Simulated and Human Dialogues in One-on-One Instruction

132

02 Sep 2025

DocR1: Evidence Page-Guided GRPO for Multi-Page Document Understanding

201

10 Aug 2025

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

602

112

16 Mar 2025

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

...

OffRL AI4TS LRM ReLM VLM

1.2K

5,342

22 Jan 2025

Object-level Visual Prompts for Compositional Image Generation

197

03 Jan 2025

C3oT: Generating Shorter Chain-of-Thought without Compromising EffectivenessAAAI Conference on Artificial Intelligence (AAAI), 2024

472

114

16 Dec 2024

Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-TuningAAAI Conference on Artificial Intelligence (AAAI), 2024

297

14 Dec 2024

DOGR: Towards Versatile Visual Document Grounding and Referring

557

26 Nov 2024

MinerU: An Open-Source Solution for Precise Document Content Extraction

Bin Wang

Chao Xu

Xiaomeng Zhao

Linke Ouyang

Fan Wu

...

Wei Li

Botian Shi

Yu Qiao

Dahua Lin

Conghui He

192

133

27 Sep 2024

Attention Prompting on Image for Large Vision-Language ModelsEuropean Conference on Computer Vision (ECCV), 2024

Runpeng Yu

Weihao Yu

Xinchao Wang

VLM

393

25 Sep 2024

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li

Yuanhan Zhang

Dong Guo

Renrui Zhang

Feng Li

Hao Zhang

Kaichen Zhang

Yanwei Li

Ziwei Liu

Chunyuan Li

MLLM SyDa VLM

573

1,767

06 Aug 2024

Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding

222

19 Jul 2024

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

...

Dahua Lin

Yu Qiao

Jifeng Dai

Wenhai Wang

MLLM VLM

530

994

25 Apr 2024

TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding

244

15 Apr 2024

LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding

379

08 Apr 2024

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

Jiabo Ye

...

Ji Zhang

Qin Jin

Fei Huang

Jingren Zhou

VLM

309

199

19 Mar 2024

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Weijie Su

...

Ping Luo

Yu Qiao

641

2,210

21 Dec 2023

Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in Language ModelsAAAI Conference on Artificial Intelligence (AAAI), 2023

194

14 Dec 2023

Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region ConcentrationIEEE International Conference on Computer Vision (ICCV), 2023

202

03 Sep 2023

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai

Shuai Bai

Shusheng Yang

Shijie Wang

Sinan Tan

Peng Wang

Junyang Lin

Chang Zhou

Jingren Zhou

MLLM VLM ObjD

535

1,598

24 Aug 2023

MMBench: Is Your Multi-modal Model an All-around Player?European Conference on Computer Vision (ECCV), 2023

...

Conghui He

Ziwei Liu

Kai-xiang Chen

Dahua Lin

713

1,664

12 Jul 2023

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Jiabo Ye

...

Ji Zhang

236

156

04 Jul 2023

Fine-Grained Visual PromptingNeural Information Processing Systems (NeurIPS), 2023

Lingfeng Yang

Yueze Wang

Xiang Li

Xinlong Wang

Jian Yang

ObjD VLM

245

07 Jun 2023

Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering

420

01 Jun 2023

Document Understanding Dataset and Evaluation (DUDE)IEEE International Conference on Computer Vision (ICCV), 2023

...

Matthew Blaschko

302

111

15 May 2023

Structured Chain-of-Thought Prompting for Code GenerationACM Transactions on Software Engineering and Methodology (TOSEM), 2023

450

254

11 May 2023

T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large Language Model Signals for Science Question AnsweringAAAI Conference on Artificial Intelligence (AAAI), 2023

Lei Wang

363

05 May 2023

GeoLayoutLM: Geometric Pre-training for Visual Information ExtractionComputer Vision and Pattern Recognition (CVPR), 2023

267

21 Apr 2023

Progressive Visual Prompt Learning with Contrastive Feature Re-formationInternational Journal of Computer Vision (IJCV), 2023

Yuhan Zhu

297

17 Apr 2023

What does CLIP know about a red circle? Visual prompt engineering for VLMsIEEE International Conference on Computer Vision (ICCV), 2023

Aleksandar Shtedritski

Christian Rupprecht

Andrea Vedaldi

VLM MLLM

383

231

13 Apr 2023

Multimodal Chain-of-Thought Reasoning in Language Models

George Karypis

489

712

02 Feb 2023

VRDU: A Benchmark for Visually-rich Document UnderstandingKnowledge Discovery and Data Mining (KDD), 2022

Chen-Yu Lee

156

15 Nov 2022

Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document UnderstandingInternational Journal on Document Analysis and Recognition (IJDAR), 2022

266

27 Jun 2022

Chain-of-Thought Prompting Elicits Reasoning in Large Language ModelsNeural Information Processing Systems (NeurIPS), 2022

2.3K

14,735

28 Jan 2022

Document AI: Benchmarks, Models and Applications

245

16 Nov 2021

FeTaQA: Free-form Table Question AnsweringTransactions of the Association for Computational Linguistics (TACL), 2021

...

339

216

01 Apr 2021

ICDAR2019 Competition on Scanned Receipt OCR and Information ExtractionIEEE International Conference on Document Analysis and Recognition (ICDAR), 2019

208

381

18 Mar 2021

DocVQA: A Dataset for VQA on Document Images

Minesh Mathew

Dimosthenis Karatzas

C. V. Jawahar

703

1,117

01 Jul 2020

LayoutLM: Pre-training of Text and Layout for Document Image UnderstandingKnowledge Discovery and Data Mining (KDD), 2019

445

886

31 Dec 2019

PubLayNet: largest dataset ever for document layout analysisIEEE International Conference on Document Analysis and Recognition (ICDAR), 2019

Xu Zhong

Jianbin Tang

Antonio Jimeno Yepes

209

552

16 Aug 2019

ICDAR 2019 Competition on Scene Text Visual Question AnsweringIEEE International Conference on Document Analysis and Recognition (ICDAR), 2019

239

30 Jun 2019

FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents

Guillaume Jaume

H. K. Ekenel

Jean-Philippe Thiran

498

453

27 May 2019