MultiMix: A Robust Data Augmentation Framework for Cross-Lingual NLP

Annual Meeting of the Association for Computational Linguistics (ACL), 2020

28 April 2020

Abstract

Transfer learning has yielded state-of-the-art (SoTA) results in many supervised natural language processing tasks. However, annotated data for every target task in every target language is rare, especially for low-resource languages. We propose MultiMix, a novel data augmentation framework for self-supervised learning in zero-resource transfer learning scenarios. In particular, MultiMix targets to solve cross-lingual adaptation problems from a source language distribution to an unknown target language distribution, assuming no training labels are available for the target language task. At its core, MultiMix performs simultaneous self-training with data augmentation and unsupervised sample selection. To show its effectiveness, we conduct extensive experiments on zero-resource cross-lingual transfer tasks for Named Entity Recognition and Natural Language Inference. MultiMix achieves SoTA results in both tasks, outperforming the baselines by a good margin. With an in-depth model dissection, we demonstrate the cumulative contributions of different components to MultiMix's success.

View on arXiv

Comments on this paper