A major obstacle to the wide-spread adoption of neural retrieval models is that they require large supervised training sets to surpass traditional term-based techniques, which are constructed from raw corpora. In this paper, we propose an approach to zero-shot learning for passage retrieval that uses synthetic question generation to close this gap. The question generation system is trained on general domain data, but is applied to documents in the targeted domain. This allows us to create arbitrarily large, yet noisy, question-passage relevance pairs that are domain specific. Furthermore, when this is coupled with a simple hybrid term-neural model, first-stage retrieval performance can be improved further. Empirically, we show that this is an effective strategy for building neural passage retrieval models in the absence of large training corpora. Depending on the domain, this technique can even approach the accuracy of supervised models.
翻译:广泛采用神经检索模型的一个主要障碍是,这些模型需要大型的监管培训,以超越传统术语技术,这些技术是用原始公司建造的。在本文中,我们建议采用零光的通道检索学习方法,使用合成问题生成来缩小这一差距。问题生成系统经过一般域数据培训,但适用于目标领域的文件。这使我们能够制造出专有的、但又吵闹的、有问题的对子,这是特定领域的。此外,如果结合一个简单的混合术语-神经模型,第一阶段的检索性能可以进一步改进。我们经常地表明,这是在没有大型培训公司的情况下建立神经通道检索模型的有效战略。根据领域的情况,这种技术甚至可以接近受监督模型的准确性。