This work presents methods for learning cross-lingual sentence representations using paired or unpaired bilingual texts. We hypothesize that the cross-lingual alignment strategy is transferable, and therefore a model trained to align only two languages can encode multilingually more aligned representations. And such transfer from bilingual alignment to multilingual alignment is a dual-pivot transfer from two pivot languages to other language pairs. To study this theory, we train an unsupervised model with unpaired sentences and another single-pair supervised model with bitexts, both based on the unsupervised language model XLM-R. The experiments evaluate the models as universal sentence encoders on the task of unsupervised bitext mining on two datasets, where the unsupervised model reaches the state of the art of unsupervised retrieval, and the alternative single-pair supervised model approaches the performance of multilingually supervised models. The results suggest that bilingual training techniques as proposed can be applied to get sentence representations with higher multilingual alignment.
翻译:这项工作展示了使用配对或非配对双语文本来学习跨语言句式的方法。 我们假设跨语言协调战略是可以转让的, 因此, 一个经过训练的模型只有两种语言可以对多语言表达法进行编码。 从双语调整到多语言调整, 这是一种从两种支流语言向其他语言对口的双重传输。 为了研究这一理论, 我们用未经监督的语句来培训一种不受监督的模式, 以及另一个由比特文本组成的单一语言监督的模式, 这两种模式都以未经监督的语言模式 XLM- R为基础。 实验评估了作为通用句子编码的模型, 用于在两套数据集上进行非监督的比特版挖掘, 在那里, 未监督的模式达到了非监督检索的艺术状态, 而替代的单一语言监督模式则采用多语言监督模式的绩效。 结果显示, 拟议的双语培训技术可以应用到在更高多语言调制的语种语言表达法下获得通用句式。