We present CORA, a Cross-lingual Open-Retrieval Answer Generation model that can answer questions across many languages even when language-specific annotated data or knowledge sources are unavailable. We introduce a new dense passage retrieval algorithm that is trained to retrieve documents across languages for a question. Combined with a multilingual autoregressive generation model, CORA answers directly in the target language without any translation or in-language retrieval modules as used in prior work. We propose an iterative training method that automatically extends annotated data available only in high-resource languages to low-resource ones. Our results show that CORA substantially outperforms the previous state of the art on multilingual open question answering benchmarks across 26 languages, 9 of which are unseen during training. Our analyses show the significance of cross-lingual retrieval and generation in many languages, particularly under low-resource settings.
翻译:我们推出跨语言开放检索问答模式CORA, 它可以回答多种语言的问题, 即使没有语言专用附加说明的数据或知识来源。 我们引入了一种新的密集通道检索算法,经过培训可以跨语言检索文件。 结合多语言自动递增模式,CORA直接以目标语言回答,而没有先前工作中使用的任何翻译或语言检索模块。 我们提议了一种迭代培训方法,将仅以高资源语言提供的附加说明的数据自动扩展到低资源语言。 我们的结果表明,CORA大大超越了以前在多语言公开问题上对26种语言进行回答的先进水平,其中9种语言在培训期间是看不见的。 我们的分析表明,多语言跨语言检索和生成的重要性,特别是在低资源环境下。