ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2412.07626
84
5

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

10 December 2024
Linke Ouyang
Yuan Qu
Hongbin Zhou
Jiawei Zhu
Rui Zhang
Qunshu Lin
Bin Wang
Zhiyuan Zhao
Man Jiang
Xiaomeng Zhao
Jin Shi
Fan Wu
Pei Chu
Minghao Liu
Zhenxiang Li
Chao Xu
Bo Zhang
Botian Shi
Zhongying Tu
Conghui He
ArXivPDFHTML
Abstract

Document content extraction is a critical task in computer vision, underpinning the data needs of large language models (LLMs) and retrieval-augmented generation (RAG) systems. Despite recent progress, current document parsing methods have not been fairly and comprehensively evaluated due to the narrow coverage of document types and the simplified, unrealistic evaluation procedures in existing benchmarks. To address these gaps, we introduce OmniDocBench, a novel benchmark featuring high-quality annotations across nine document sources, including academic papers, textbooks, and more challenging cases such as handwritten notes and densely typeset newspapers. OmniDocBench supports flexible, multi-level evaluations--ranging from an end-to-end assessment to the task-specific and attribute--based analysis using 19 layout categories and 15 attribute labels. We conduct a thorough evaluation of both pipeline-based methods and end-to-end vision-language models, revealing their strengths and weaknesses across different document types. OmniDocBench sets a new standard for the fair, diverse, and fine-grained evaluation in document parsing. Dataset and code are available atthis https URL.

View on arXiv
@article{ouyang2025_2412.07626,
  title={ OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations },
  author={ Linke Ouyang and Yuan Qu and Hongbin Zhou and Jiawei Zhu and Rui Zhang and Qunshu Lin and Bin Wang and Zhiyuan Zhao and Man Jiang and Xiaomeng Zhao and Jin Shi and Fan Wu and Pei Chu and Minghao Liu and Zhenxiang Li and Chao Xu and Bo Zhang and Botian Shi and Zhongying Tu and Conghui He },
  journal={arXiv preprint arXiv:2412.07626},
  year={ 2025 }
}
Comments on this paper