636

MuSiQue: Multi-hop Questions via Single-hop Question Composition

Abstract

Can we create a question answering (QA) dataset that, by construction, requires proper multi-hop reasoning? This goal has been surprisingly elusive. We introduce a bottom-up approach that systematically selects composable pairs of single-hop questions that are connected, i.e., where one reasoning step requires information from the other. This bottom-up approach allows greater control over the properties of the resulting kk-hop questions. We add stringent filters and other mechanisms targeting connected reasoning, including minimizing many forms of train-test leakage, improved distractor contexts, and contrasting unanswerable questions at the sub-question level. We use this process to construct MuSiQue-Ans, a new multihop QA dataset with 25K 2-4 hop questions, built using seed questions from 5 existing single-hop datasets. Our experiments demonstrate that MuSiQue-Ans is challenging for state-of-the-art QA models significantly harder than existing datasets (3x human-machine gap in a comparable setting), and substantially less cheatable (e.g., a single-hop model is worse by 30 F1 pts). We also build a more challenging dataset, MuSiQue-Full, consisting of answerable and unanswerable contrast question pairs, where model performance drops further by 14 F1 pts.

View on arXiv
Comments on this paper