ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2012.14740
118
393

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

29 December 2020
Yang Xu
Yiheng Xu
Tengchao Lv
Lei Cui
Furu Wei
Guoxin Wang
Yijuan Lu
D. Florêncio
Cha Zhang
Wanxiang Che
Min Zhang
Lidong Zhou
    ViT
    MLLM
ArXivPDFHTML
Abstract

Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks, which make it better capture the cross-modality interaction in the pre-training stage. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms LayoutLM by a large margin and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 →\to→ 0.8420), CORD (0.9493 →\to→ 0.9601), SROIE (0.9524 →\to→ 0.9781), Kleister-NDA (0.8340 →\to→ 0.8520), RVL-CDIP (0.9443 →\to→ 0.9564), and DocVQA (0.7295 →\to→ 0.8672). We made our model and code publicly available at \url{https://aka.ms/layoutlmv2}.

View on arXiv
Comments on this paper