Table Region Detection on Large-scale PDF Files without Labeled Data

29 June 2015

Abstract

Superior to state-of-the-art approaches which compete performances of table recognition on 67 annotated government documents released by {\it ICDAR 2013 Table Competition}, this paper contributes a novel paradigm for table region detection on large-scale unlabeled PDF files. We integrate the paradigm into our latest developed system ({\it PdfExtra}) to detect the region of tables by means of 9,466 academic articles from the entire repository of ACL Anthology, where all papers are archived by PDF format without annotation. The paradigm first adopts heuristics to automatically construct weakly labeled data, then feeds diverse evidences, such as font-styles, layouts and even linguistic features extracted by Apache PDFBox and processed by OpenNLP Toolkit, into different canonical classifiers (Logistic Regression, Support Vector Machine and Naive Bayes), and finally uses these models to vote on the boundary of tables. Experimental results show that {\it PdfExtra} performs a great leap forward, compared with the state-of-the-arts. Moreover, we discuss the factors of features, learning models and even domains that may impact the performance. Extensive evaluations demonstrate that our paradigm is compatible enough to leverage various features and learning models for cross-domain table region detection.

View on arXiv

Comments on this paper