This paper proposes a pre-training objective based on question answering (QA) for learning general-purpose contextual representations, motivated by the intuition that the representation of a phrase in a passage should encode all questions that the phrase can answer in context. We accomplish this goal by training a bi-encoder QA model, which independently encodes passages and questions, to match the predictions of a more accurate cross-encoder model on 80 million synthesized QA pairs. By encoding QA-relevant information, the bi-encoder's token-level representations are useful for non-QA downstream tasks without extensive (or in some cases, any) fine-tuning. We show large improvements over both RoBERTa-large and previous state-of-the-art results on zero-shot and few-shot paraphrase detection on four datasets, few-shot named entity recognition on two datasets, and zero-shot sentiment analysis on three datasets.
翻译:本文件提出一个培训前目标,其依据是学习通用背景说明的问答(QA),其依据是直觉认为,某一段落中的短语表示应当将该短语能够回答的所有问题编码起来。我们通过培训双编码器QA模型来实现这一目标,该模型独立编码了段落和问题,以匹配对8 000万对合成质量A的更准确交叉编码模型的预测。通过编码与QA有关的信息,双编码器的象征性表示对于非QA下游任务有用,没有广泛的(或在某些情况下,任何)微调。我们展示了对RoBERTA大型和以往最先进的零发和几发语音分析的四套数据集探测结果的巨大改进,对两套数据集的点名实体识别,对三套数据集的零发情绪分析。