Coupled with the availability of large scale datasets, deep learning architectures have enabled rapid progress on the Question Answering task. However, most of those datasets are in English, and the performances of state-of-the-art multilingual models are significantly lower when evaluated on non-English data. Due to high data collection costs, it is not realistic to obtain annotated data for each language one desires to support. We propose a method to improve the Cross-lingual Question Answering performance without requiring additional annotated data, leveraging Question Generation models to produce synthetic samples in a cross-lingual fashion. We show that the proposed method allows to significantly outperform the baselines trained on English data only. We report a new state-of-the-art on four multilingual datasets: MLQA, XQuAD, SQuAD-it and PIAF (fr).
翻译:与大规模数据集的可用性相结合,深层次的学习结构使得在回答问题的任务上取得了迅速的进展,然而,这些数据集大多是英文数据集,在对非英文数据进行评估时,最先进的多语文模型的性能要低得多。由于数据收集费用高,获得关于人们所希望支持的每一种语文的附加说明的数据是不现实的。我们提议了一个方法来改进跨语文回答问题的业绩,而不需要附加附加说明的数据,利用问题生成模型来以跨语文方式制作合成样品。我们表明,拟议的方法能够大大超过仅经过英语数据培训的基线。我们报告了四种多语文数据集的新情况:MLQA、XQAD、SQuAD-it和PIAF(f)。