We present Cross-lingual Open-Retrieval Answer Generation (CORA), the first unified many-to-many question answering (QA) model that can answer questions across many languages, even for ones without language-specific annotated data or knowledge sources. We introduce a new dense passage retrieval algorithm that is trained to retrieve documents across languages for a question. Combined with a multilingual autoregressive generation model, CORA answers directly in the target language without any translation or in-language retrieval modules as used in prior work. We propose an iterative training method that automatically extends annotated data available only in high-resource languages to low-resource ones. Our results show that CORA substantially outperforms the previous state of the art on multilingual open QA benchmarks across 26 languages, 9 of which are unseen during training. Our analyses show the significance of cross-lingual retrieval and generation in many languages, particularly under low-resource settings.
翻译:我们提出跨语言开放检索问答(CORA)模式,这是第一个能够回答多种语言问题的统一的多到多种问题解答模式,即使是没有语言专用附加说明的数据或知识来源的解答模式。我们引入了一种新的密集通道检索算法,经过培训,可以跨语言检索文件。结合多语言自动递增生成模式,CORA直接以目标语言回答,而没有先前工作中使用的任何翻译或语言检索模块。我们建议了一种迭代培训方法,将仅以高资源语言提供的附加说明的数据自动扩展到低资源语言。我们的结果表明,CORA大大超越了26种语言的多语言开放问答基准,其中9种在培训期间是看不见的。我们的分析表明,多语言的检索和生成的意义,特别是在低资源环境下。