Accurate alignment between languages is fundamental for improving cross-lingual pre-trained language models (XLMs). Motivated by the natural phenomenon of code-switching (CS) in multilingual speakers, CS has been used as an effective data augmentation method that offers language alignment at the word- or phrase-level, in contrast to sentence-level via parallel instances. Existing approaches either use dictionaries or parallel sentences with word alignment to generate CS data by randomly switching words in a sentence. However, such methods can be suboptimal as dictionaries disregard semantics, and syntax might become invalid after random word switching. In this work, we propose EntityCS, a method that focuses on Entity-level Code-Switching to capture fine-grained cross-lingual semantics without corrupting syntax. We use Wikidata and English Wikipedia to construct an entity-centric CS corpus by switching entities to their counterparts in other languages. We further propose entity-oriented masking strategies during intermediate model training on the EntityCS corpus for improving entity prediction. Evaluation of the trained models on four entity-centric downstream tasks shows consistent improvements over the baseline with a notable increase of 10% in Fact Retrieval. We release the corpus and models to assist research on code-switching and enriching XLMs with external knowledge.
翻译:语言之间的准确一致是改进跨语言预先培训的语言模式(XLM)的根本。受多语种语言代码转换(CS)自然现象的驱使,CS被用作一种有效的数据增强方法,在单词或词组一级提供语言一致性,而通过平行的句级则不同。现有的方法要么使用词典,要么用词对齐平行句来生成CS数据,在句中随机转换词词组。然而,这种方法可能是不最优化的,因为字典无视语义,在随机换字后,通识税可能变得无效。在这项工作中,我们建议EmptyCS是一个侧重于实体一级代码转换(CS)的方法,在不腐蚀词组的同时,在单词组或词组一级提供语言校正语言校正,同时将实体改换成其他语言的对应方名。我们进一步建议,在实体CSAPS文库改进实体预测的中间模式培训中,面向实体的遮掩战略可能是无效的。我们建议,在四个实体中心下游任务中经过训练的模式评价后,实体一级代码转换法系,即显示精确的改进的外部模型,并在10级数据库中进行升级后,我们协助进行显著修正。