Table Region Detection on Large-scale PDF Files without Labeled Data

29 June 2015

Abstract

Superior to state-of-the-art approaches which compete performances of table recognition on 67 annotated government documents released by {\it ICDAR 2013 Table Competition}, this paper contributes a novel paradigm on large-scale unlabeled PDF files. We integrate the paradigm into our latest developed system ({\it PdfExtra}) to detect the region of tables by means of 9,466 academic articles from the entire repository of {\it ACL Anthology}, where almost all papers are archived by PDF format without annotation. The paradigm first adopts heuristics to automatically construct weakly labeled data, then feeds diverse evidences, such as font-styles, layouts and even linguistic features extracted by {\it Apache PDFBox} and processed by {\it Stanford NLP} toolkit, into different canonical classifiers, and finally uses these models to collaboratively vote on the boundary of tables. Experimental results show that {\it PdfExtra} performs a great leap forward, compared with the state-of-the-arts. Moreover, we discuss the factors of different features, learning models and even domains that may impact the performance. Extensive evaluations demonstrate that our paradigm is compatible enough to leverage various features and learning models for cross-domain table region detection.

View on arXiv

Comments on this paper