106

Page Stream Segmentation with Convolutional Neural Nets Combining Textual and Visual Features

Abstract

For digitization of paper files via OCR, preservation of document contexts of single scanned images is a major requirement. Page stream segmentation (PSS) is the task to automatically separate a stream of scanned images into multi-page documents. This can be immensely helpful in the context of "digital mailrooms" or retro-digitization of large paper archives. In a digitization project together with a German federal archive, we developed a novel PSS approach based on convolutional neural networks (CNN). Our approach combines image and text features to achieve optimal document separation results. Evaluation shows that our approach achieves accuracies up to 93 % which can be regarded as a new state-of-the-art for this task.

View on arXiv
Comments on this paper