A key stumbling block for neural cross-language information retrieval (CLIR) systems has been the paucity of training data. The appearance of the MS MARCO monolingual training set led to significant advances in the state of the art in neural monolingual retrieval. By translating the MS MARCO documents into other languages using machine translation, this resource has been made useful to the CLIR community. Yet such translation suffers from a number of problems. While MS MARCO is a large resource, it is of fixed size; its genre and domain of discourse are fixed; and the translated documents are not written in the language of a native speaker of the language, but rather in translationese. To address these problems, we introduce the JH-POLO CLIR training set creation methodology. The approach begins by selecting a pair of non-English passages. A generative large language model is then used to produce an English query for which the first passage is relevant and the second passage is not relevant. By repeating this process, collections of arbitrary size can be created in the style of MS MARCO but using naturally-occurring documents in any desired genre and domain of discourse. This paper describes the methodology in detail, shows its use in creating new CLIR training sets, and describes experiments using the newly created training data.
翻译:针对神经网络跨语言信息检索(CLIR)系统的一个关键障碍是缺乏训练数据。MS MARCO单语言训练集的出现使神经单语检索领域取得了重大进展。通过使用机器翻译将MS MARCO文档翻译成其他语言,这个资源已经被推广到CLIR社区中使用。然而,这样的翻译存在一些问题。虽然MS MARCO是一个大型资源,但它的大小是固定的;它的类型和话语领域也是固定的;而翻译文档并不是用其语言的母语者所写,而是翻译术语。为了解决这些问题,我们介绍了JH-POLO CLIR训练集创建方法。该方法首先选择一对非英语段落,然后使用生成性的大型语言模型为第一个段落生成一个相关的英文查询,而第二个段落则不相关。通过重复这个过程,可以创建任意数量和类型的文档集合,其风格类似于MS MARCO,并使用任何所需的流派和话语领域中的自然出现的文档。本文详细描述了这种方法,展示了它在创建新的CLIR训练集方面的应用,并描述了使用新创建的训练数据的实验结果。