ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2308.14978
18
27

Vision Grid Transformer for Document Layout Analysis

29 August 2023
Cheng Da
Chuwei Luo
Qi Zheng
Cong Yao
    ViT
ArXivPDFHTML
Abstract

Document pre-trained models and grid-based models have proven to be very effective on various tasks in Document AI. However, for the document layout analysis (DLA) task, existing document pre-trained models, even those pre-trained in a multi-modal fashion, usually rely on either textual features or visual features. Grid-based models for DLA are multi-modality but largely neglect the effect of pre-training. To fully leverage multi-modal information and exploit pre-training techniques to learn better representation for DLA, in this paper, we present VGT, a two-stream Vision Grid Transformer, in which Grid Transformer (GiT) is proposed and pre-trained for 2D token-level and segment-level semantic understanding. Furthermore, a new dataset named D4^44LA, which is so far the most diverse and detailed manually-annotated benchmark for document layout analysis, is curated and released. Experiment results have illustrated that the proposed VGT model achieves new state-of-the-art results on DLA tasks, e.g. PubLayNet (95.7%95.7\%95.7%→\rightarrow→96.2%96.2\%96.2%), DocBank (79.6%79.6\%79.6%→\rightarrow→84.1%84.1\%84.1%), and D4^44LA (67.7%67.7\%67.7%→\rightarrow→68.8%68.8\%68.8%). The code and models as well as the D4^44LA dataset will be made publicly available ~\url{https://github.com/AlibabaResearch/AdvancedLiterateMachinery}.

View on arXiv
Comments on this paper