Self-Teaching Machines to Read and Comprehend with Large-Scale Multi-Subject Question Answering Data

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021

1 February 2021

Dian Yu

Kai Sun

Dong Yu

Claire Cardie

ArXiv (abs)PDF HTML

Abstract

In spite of much recent research in the area, it is still unclear whether subject-area question-answering data is useful for machine reading comprehension (MRC) tasks. In this paper, we investigate this question. We collect a large-scale multi-subject multiple-choice question-answering dataset, ExamQA, and use incomplete and noisy snippets returned by a web search engine as the relevant context for each question-answering instance to convert it into a weakly-labeled MRC instance. We then propose a self-teaching paradigm to better use the generated weakly-labeled MRC instances to improve a target MRC task. Experimental results show that we can obtain an improvement of 5.1% in accuracy on a multiple-choice MRC dataset, C^3, demonstrating the effectiveness of our framework and the usefulness of large-scale subject-area question-answering data for machine reading comprehension.

View on arXiv

Comments on this paper