In this paper, we propose a weakly supervised multilingual representation learning framework, called cross-lingual self-training (XLST). XLST is able to utilize a small amount of annotated data from high-resource languages to improve the representation learning on multilingual un-annotated data. Specifically, XLST uses a supervised trained model to produce initial representations and another model to learn from them, by maximizing the similarity between output embeddings of these two models. Furthermore, the moving average mechanism and multi-view data augmentation are employed, which are experimentally shown to be crucial to XLST. Comprehensive experiments have been conducted on the CommonVoice corpus to evaluate the effectiveness of XLST. Results on 5 downstream low-resource ASR tasks shows that our multilingual pretrained model achieves relatively 18.6% PER reduction over the state-of-the-art self-supervised method, with leveraging additional 100 hours of annotated English data.
翻译:在本文中,我们提出一个监督不力的多语种代表性学习框架,称为跨语言自我培训(XLST)。 XLST能够利用来自高资源语言的少量附加说明的数据来改进多语种无附加说明数据的代表性学习。 具体地说, XLST使用一个经过监督的、经过监督的、经过培训的模型来制作初步表述,以及另一个从中学习的模式,办法是尽量扩大这两种模式的产出嵌入的相似性。 此外,还采用了移动平均机制和多视图数据增强,这在实验中证明对 XLST至关重要。 对通用语音资料库进行了全面实验,以评估 XLST的有效性。 关于5个下游低资源ASR任务的成果显示,我们的多语种预先培训模式比最先进的自我监督方法实现了相对18.6%的PER下降,同时利用额外的100小时附加说明的英语数据。