Supervised mid-level features for word image representation

20 October 2014

Abstract

This paper deals with the problem of learning word image representations: given the image of a word, we are interested in finding a descriptive and robust, fixed-length representation. Machine learning techniques can then be used on top of these representations to produce models useful for word retrieval or recognition tasks. Although many works have focused on the machine learning aspect once a global representation has been produced, little work has been devoted to the construction of those base image representations: most works use standard coding and aggregation techniques directly on top of standard computer vision features such as SIFT or HOG. In this paper we propose to learn local mid-level features suitable for building text image representations. These features are learnt by leveraging character bounding boxes annotations on a small subset of training images. However, contrary to other approaches that use character bounding box information, our approach does not rely on detecting the individual characters explicitly at testing time. These local mid-level features can then be aggregated to produce a global word image signature. When pairing these features with the recent attributes framework of Almazan et al. [4], we obtain results comparable or better than the state-of-the-art on matching and recognition tasks with global descriptors of only 96 dimensions.

View on arXiv

Comments on this paper