353

Multimodal Pivots for Image Caption Translation

Abstract

We present an approach to improve statistical machine translation of image descriptions by multimodal pivots defined in visual space. The key idea is to disambiguate and ground the translation of an image desription by involving the image as a pivot into the translation process. We compute image similarity by a convolutional neural network, and use descriptions of most similar pivot images for crosslingual reranking of translation outputs. Our approach does not depend on the availability of large amounts of in-domain parallel data and achieves improvements of 1 BLEU point over strong baselines.

View on arXiv
Comments on this paper