Span-Excription 阅读理解多语种建模方法 (A Multilingual Modeling Method for Span-Extraction Reading Comprehension)

Span-extraction reading comprehension models have made tremendous advances enabled by the availability of large-scale, high-quality training datasets. Despite such rapid progress and widespread application, extractive reading comprehension datasets in languages other than English remain scarce, and creating such a sufficient amount of training data for each language is costly and even impossible. An alternative to creating large-scale high-quality monolingual span-extraction training datasets is to develop multilingual modeling approaches and systems which can transfer to the target language without requiring training data in that language. In this paper, in order to solve the scarce availability of extractive reading comprehension training data in the target language, we propose a multilingual extractive reading comprehension approach called XLRC by simultaneously modeling the existing extractive reading comprehension training data in a multilingual environment using self-adaptive attention and multilingual attention. Specifically, we firstly construct multilingual parallel corpora by translating the existing extractive reading comprehension datasets (i.e., CMRC 2018) from the target language (i.e., Chinese) into different language families (i.e., English). Secondly, to enhance the final target representation, we adopt self-adaptive attention (SAA) to combine self-attention and inter-attention to extract the semantic relations from each pair of the target and source languages. Furthermore, we propose multilingual attention (MLA) to learn the rich knowledge from various language families. Experimental results show that our model outperforms the state-of-the-art baseline (i.e., RoBERTa_Large) on the CMRC 2018 task, which demonstrate the effectiveness of our proposed multi-lingual modeling approach and show the potentials in multilingual NLP tasks.

翻译：大规模多语言、高质量培训数据集的可用性使读取理解模型取得了巨大进步。尽管取得了如此迅速的进步和广泛应用,但使用英语以外语言的读取理解数据仍然很少,而且为每种语言创建足够数量的培训数据成本很高甚至是不可能的。除了创建大规模高质量单一语言的跨语言理解理解模型之外,我们的第一个替代办法是开发多语言模型和系统,这些模型和系统可以在不需要语言培训数据的情况下转换到目标语言(即2018 CMRC 模型)中的目标语言(即中文)中,向不同语言家庭(即英语)转移。我们建议一种称为 XLRC 的多语言阅读理解理解方法,同时在多语言环境中模拟现有的阅读理解培训数据,同时使用自我适应和多语言的关注。具体地说,我们首先将现有的解读理解数据集(即CMRC 2018模型)从目标语言(即我们的语言、中文)转移到不同语言(即英语)的家庭。其次,为了加强最终的目标表达,我们采用自我理解的基线,我们采用自我理解的排序显示我们之间学习的源(我们之间的自我理解) 展示,我们之间的排序展示了我们之间的基线关系。