Massively multilingual sentence representation models, e.g., LASER, SBERT-distill, and LaBSE, help significantly improve cross-lingual downstream tasks. However, multiple training procedures, the use of a large amount of data, or inefficient model architectures result in heavy computation to train a new model according to our preferred languages and domains. To resolve this issue, we introduce efficient and effective massively multilingual sentence representation learning (EMS), using cross-lingual sentence reconstruction (XTR) and sentence-level contrastive learning as training objectives. Compared with related studies, the proposed model can be efficiently trained using significantly fewer parallel sentences and GPU computation resources without depending on large-scale pre-trained models. Empirical results show that the proposed model significantly yields better or comparable results with regard to bi-text mining, zero-shot cross-lingual genre classification, and sentiment classification. Ablative analyses demonstrate the effectiveness of each component of the proposed model. We release the codes for model training and the EMS pre-trained model, which supports 62 languages (https://github.com/Mao-KU/EMS).
翻译:大量多语种刑罚代表模式,例如LASER、SBERT-distill和LABSE,有助于大大改进跨语言下游任务,然而,多种培训程序、使用大量数据或效率低下的模式结构,导致根据我们偏好的语言和领域进行大量计算,以培训新模式;为解决这一问题,我们采用高效和有效的大规模多语种判刑代表学习,使用跨语言刑罚重建(XTR)和在量刑层次上对比学习作为培训目标。与相关研究相比,拟议的模式可以高效地培训,使用大大减少平行的判刑和GPU计算资源,而不必依赖大型的预培训模式。经验性结果显示,拟议的模式在双文本采矿、零点跨语言类型分类和情绪分类方面产生更好的或可比的结果。比较分析表明拟议模式每个组成部分的有效性。我们发布了模式培训守则和EMS预培训模式,支持62种语言(https://github.com/Mao-KU/EMS)。