Open-retrieval question answering systems are generally trained and tested on large datasets in well-established domains. However, low-resource settings such as new and emerging domains would especially benefit from reliable question answering systems. Furthermore, multilingual and cross-lingual resources in emergent domains are scarce, leading to few or no such systems. In this paper, we demonstrate a cross-lingual open-retrieval question answering system for the emergent domain of COVID-19. Our system adopts a corpus of scientific articles to ensure that retrieved documents are reliable. To address the scarcity of cross-lingual training data in emergent domains, we present a method utilizing automatic translation, alignment, and filtering to produce English-to-all datasets. We show that a deep semantic retriever greatly benefits from training on our English-to-all data and significantly outperforms a BM25 baseline in the cross-lingual setting. We illustrate the capabilities of our system with examples and release all code necessary to train and deploy such a system.
翻译:开放检索问题解答系统一般都是在成熟的域域的大型数据集上进行培训和测试,然而,诸如新的和新兴域等低资源环境将特别受益于可靠的问答系统;此外,新兴域的多语种和跨语言资源稀缺,导致很少或根本没有这样的系统;在本文件中,我们展示了一个跨语言的开放检索问题解答系统,用于COVID-19的新兴域。我们的系统采用一系列科学文章,以确保检索到的文件是可靠的。为解决新兴域的跨语言培训数据稀缺的问题,我们提出了一个使用自动翻译、校正和过滤的方法,以产生英语到所有数据集。我们显示,一个深层次的语义检索器从我们英语到所有域的数据培训中受益甚多,大大超出跨语言环境中的BM25基线。我们用实例说明我们的系统的能力,并发布培训和部署这种系统所需的所有代码。