This paper introduces our proposed system for the MIA Shared Task on Cross-lingual Open-retrieval Question Answering (COQA). In this challenging scenario, given an input question the system has to gather evidence documents from a multilingual pool and generate from them an answer in the language of the question. We devised several approaches combining different model variants for three main components: Data Augmentation, Passage Retrieval, and Answer Generation. For passage retrieval, we evaluated the monolingual BM25 ranker against the ensemble of re-rankers based on multilingual pretrained language models (PLMs) and also variants of the shared task baseline, re-training it from scratch using a recently introduced contrastive loss that maintains a strong gradient signal throughout training by means of mixed negative samples. For answer generation, we focused on language- and domain-specialization by means of continued language model (LM) pretraining of existing multilingual encoders. Additionally, for both passage retrieval and answer generation, we augmented the training data provided by the task organizers with automatically generated question-answer pairs created from Wikipedia passages to mitigate the issue of data scarcity, particularly for the low-resource languages for which no training data were provided. Our results show that language- and domain-specialization as well as data augmentation help, especially for low-resource languages.
翻译:本文介绍了我们提议的跨语言开放检索问答共享任务系统(COQA ) 。 在这种富有挑战性的设想中,鉴于一个投入问题,该系统必须从多语言库收集证据文件,并以问题的语言从中找到答案。我们设计了几种办法,将数据增强、通行证检索和问答生成这三大组成部分的不同模式变体结合起来:数据增强、通道检索和问答生成。关于通道检索,我们评估了单语言BM25排名,以基于多语言预先培训模式和共享任务基线变量的重新排名组合为基础,并用最近引入的对比性损失,通过混合负面样本在整个培训中保持一个强大的梯度信号,从零开始对它进行重新培训。关于答案生成,我们侧重于语言和领域专门化,通过持续的语言模型(LM)对现有多语言编码师进行预培训。此外,我们增加了任务组织者提供的培训数据,从维基百科段落中自动生成问答配对,以缓解数据稀缺度问题,特别是低资源数据库显示,而低资源数据库则没有提供。