ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2409.10016
15
0

AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing

16 September 2024
Huawei Ji
Cheng Deng
Bo Xue
Zhouyang Jin
Jiaxin Ding
Xiaoying Gan
Luoyi Fu
Xinbing Wang
Chenghu Zhou
ArXivPDFHTML
Abstract

With the development of data-centric AI, the focus has shifted from model-driven approaches to improving data quality. Academic literature, as one of the crucial types, is predominantly stored in PDF formats and needs to be parsed into texts before further processing. However, parsing diverse structured texts in academic literature remains challenging due to the lack of datasets that cover various text structures. In this paper, we introduce AceParse, the first comprehensive dataset designed to support the parsing of a wide range of structured texts, including formulas, tables, lists, algorithms, and sentences with embedded mathematical expressions. Based on AceParse, we fine-tuned a multimodal model, named AceParser, which accurately parses various structured texts within academic literature. This model outperforms the previous state-of-the-art by 4.1% in terms of F1 score and by 5% in Jaccard Similarity, demonstrating the potential of multimodal models in academic literature parsing. Our dataset is available atthis https URL.

View on arXiv
@article{ji2025_2409.10016,
  title={ AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing },
  author={ Huawei Ji and Cheng Deng and Bo Xue and Zhouyang Jin and Jiaxin Ding and Xiaoying Gan and Luoyi Fu and Xinbing Wang and Chenghu Zhou },
  journal={arXiv preprint arXiv:2409.10016},
  year={ 2025 }
}
Comments on this paper