223

Group-Sparse Matrix Factorization for Transfer Learning of Word Embeddings

International Conference on Machine Learning (ICML), 2021
Abstract

Unstructured text provides decision-makers with a rich data source in many domains, ranging from product reviews in retailing to nursing notes in healthcare. To leverage this information, words are typically translated into word embeddings -- vectors that encode the semantic relationships between words -- through unsupervised learning algorithms such as matrix factorization. However, learning word embeddings from new domains with limited training data can be challenging, because the meaning/usage may be different in the new domain, e.g., the word "positive" typically has positive sentiment, but often has negative sentiment in medical notes since it may imply that a patient is tested positive for a disease. Intuitively, we expect that only a small number of domain-specific words may have new meanings/usages. We propose an intuitive two-stage estimator that exploits this structure via a group-sparse penalty to efficiently transfer learn domain-specific word embeddings by combining large-scale text corpora (such as Wikipedia) with limited domain-specific text data. We bound the generalization error of our estimator, proving that it can achieve the same accuracy (compared to not transfer learning) with substantially less domain-specific data when only a small number of embeddings are altered between domains. Our results provide the first bounds on group-sparse matrix factorization, which may be of independent interest. We empirically evaluate the effectiveness of our approach compared to state-of-the-art fine-tuning heuristics from natural language processing.

View on arXiv
Comments on this paper