Despite recent success in machine reading comprehension (MRC), learning high-quality MRC models still requires large-scale labeled training data, even using strong pre-trained language models (PLMs). The pre-training tasks for PLMs are not question-answering or MRC-based tasks, making existing PLMs unable to be directly used for unsupervised MRC. Specifically, MRC aims to spot an accurate answer span from the given document, but PLMs focus on token filling in sentences. In this paper, we propose a new framework for unsupervised MRC. Firstly, we propose to learn to spot answer spans in documents via self-supervised learning, by designing a self-supervision pretext task for MRC - Spotting-MLM. Solving this task requires capturing deep interactions between sentences in documents. Secondly, we apply a simple sentence rewriting strategy in the inference stage to alleviate the expression mismatch between questions and documents. Experiments show that our method achieves a new state-of-the-art performance for unsupervised MRC.
翻译:尽管最近在机读理解(MRC)方面取得了成功,但学习高质量的MRC模型仍需要大规模标记的培训数据,即使使用强大的预先培训语言模型(PLM)也是如此。 PLM公司的培训前任务不是问答或MRC任务,使现有的MLM公司无法直接用于不受监督的MRC。 具体地说,MRC公司的目标是从给定文件中找出准确的答案,但PLM公司的重点是象征性补丁。 在本文中,我们提出了未经监督的MRC的新框架。 首先,我们提议通过自我监督的学习,通过设计MRC - Spointting-MLMM公司自我监督的借口任务,在文件中找到答案。 解决这项任务需要捕捉文件中的判决之间的深刻互动。 其次,我们在推论阶段采用简单的句子重写战略,以缓解问题和文件之间的表达不匹配。 实验表明,我们的方法为未受监督的MC公司实现了新的最先进的表现。