BilBOWA: Fast Bilingual Distributed Representations without Word Alignments

International Conference on Machine Learning (ICML), 2014

9 October 2014

Abstract

We introduce BilBOWA ("Bilingual Bag-of-Words without Alignments"), a simple and computationally-efficient model for learning bilingual distributed representations of words which can scale to large datasets and does not require word-aligned training data. Instead it trains directly on monolingual data and extracts a bilingual signal from a smaller set of raw-text sentence-aligned data. We introduce a novel sampled bag-of-words cross-lingual objective and use it to regularize two noise-contrastive language models for efficient cross-lingual feature learning. We evaluate our model on cross-lingual document classification and lexical translation on the WMT11 data. Our code will be made available as part of the open-source word2vec toolkit.

View on arXiv

Comments on this paper