Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements over various cross-lingual and low-resource tasks. Through training on one hundred languages and terabytes of texts, cross-lingual language models have proven to be effective in leveraging high-resource languages to enhance low-resource language processing and outperform monolingual models. In this paper, we further investigate the cross-lingual and cross-domain (CLCD) setting when a pretrained cross-lingual language model needs to adapt to new domains. Specifically, we propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features and domain-invariant features from the entangled pretrained cross-lingual representations, given unlabeled raw texts in the source language. Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts. Experimental results show that our proposed method achieves significant performance improvements over the state-of-the-art pretrained cross-lingual language model in the CLCD setting. The source code of this paper is publicly available at https://github.com/lijuntaopku/UFD.
翻译:最近的研究显示,对大规模无标签文本的跨语言模式进行培训前,对大规模无标签文本的跨语言模式进行大规模跨语言模式培训,可大大改进各种跨语言和低资源任务的业绩。通过100种语言和百万字节文本的培训,跨语言模式证明能够有效地利用高资源语言加强低资源语言处理,超越单一语言模式。在经过预先培训的跨语言模式需要适应新领域时,我们进一步调查跨语言和跨语言(CLCD)设置。具体地说,我们提出一种新的、不受监督的特征分解方法,可以自动从交织的预先培训的跨语言表述中提取出特定域特征和域变量。考虑到源语的未加标签的原始文本,我们提议的模式利用相互信息估计,将跨语言模式所计算的表达方式分解成域变量和特定领域部分。实验结果表明,我们提出的方法在CLCD设置中,在经过预先培训的跨语言模式中取得了显著的绩效改进。本文的源代码可公开查阅。