Pretrained language models have significantly improved the performance of downstream language understanding tasks, including extractive question answering, by providing high-quality contextualized word embeddings. However, training question answering models still requires large amounts of annotated data for specific domains. In this work, we propose a cooperative self-training framework, RGX, for automatically generating more non-trivial question-answer pairs to improve model performance. RGX is built upon a masked answer extraction task with an interactive learning environment containing an answer entity Recognizer, a question Generator, and an answer eXtractor. Given a passage with a masked entity, the generator generates a question around the entity, and the extractor is trained to extract the masked entity with the generated question and raw texts. The framework allows the training of question generation and answering models on any text corpora without annotation. Experiment results show that RGX outperforms the state-of-the-art (SOTA) pretrained language models and transfer learning approaches on standard question-answering benchmarks, and yields the new SOTA performance under given model size and transfer learning settings.
翻译:预先培训的语言模型通过提供高质量的背景化字嵌入式,大大改善了下游语言理解任务,包括问答的绩效;然而,培训回答问题模型仍需要大量特定领域的附加说明数据;在这项工作中,我们提议了一个合作自我培训框架,即RGX,用于自动生成更多的非三边问答对口,以提高模型性能;RGX基于一个蒙面解答提取任务,并有一个包含一个回答实体识别器、问题生成器和回答解答器的互动式学习环境。由于与蒙面实体的一段通道,生成器在实体周围产生问题,提取器经过培训,以利用生成的问题和原始文本提取蒙面实体。该框架允许在不作说明的情况下就任何文本组合进行问题生成和回答模型的培训。实验结果表明,RGX超越了最先进的(SOTA)预先培训的语言模型,并转让了标准问题解答基准的学习方法,并在给定型号大小和转移学习环境下生成了新的SOTA绩效。