Building machine learning prediction models for a specific NLP task requires sufficient training data, which can be difficult to obtain for less-resourced languages. Cross-lingual embeddings map word embeddings from a less-resourced language to a resource-rich language so that a prediction model trained on data from the resource-rich language can also be used in the less-resourced language. To produce cross-lingual mappings of recent contextual embeddings, anchor points between the embedding spaces have to be words in the same context. We address this issue with a novel method for creating cross-lingual contextual alignment datasets. Based on that, we propose several cross-lingual mapping methods for ELMo embeddings. The proposed linear mapping methods use existing Vecmap and MUSE alignments on contextual ELMo embeddings. Novel nonlinear ELMoGAN mapping methods are based on GANs and do not assume isomorphic embedding spaces. We evaluate the proposed mapping methods on nine languages, using four downstream tasks: named entity recognition (NER), dependency parsing (DP), terminology alignment, and sentiment analysis. The ELMoGAN methods perform very well on the NER and terminology alignment tasks, with a lower cross-lingual loss for NER compared to the direct training on some languages. In DP and sentiment analysis, linear contextual alignment variants are more successful.
翻译:为具体的国家劳动力规划任务建立机器学习预测模型需要足够的培训数据,而这种数据对于资源较少的语言来说可能难以获得。跨语言嵌入的地图字从资源较少的语言嵌入到资源丰富的语言,这样也可以在资源较少的语言中使用经过资源丰富语言数据培训的预测模型。要制作最近背景嵌入的跨语言绘图,嵌入空间之间的锚点必须是同一背景下的词。我们用一种创新的方法来解决这个问题,以创建跨语言背景校准数据集。在此基础上,我们提出了多种跨语言的绘图方法。拟议的线性制图方法使用了现有的Vecmap和MUSE在相关的ELmo嵌入语言上的校准。Novel非线性ELMOGAN绘图方法以GANs为基础,不假定嵌入空间是单词嵌入空间。我们用四种下游任务来评估9种语言的拟议绘图方法:命名实体识别(NER)、依赖对比(DP)、术语校准和情感分析。ELMOGAN系统使用一些直接的语言和直线性语言对准方法,对语言进行了比较性分析。