Sentence representation from vanilla BERT models does not work well on sentence similarity tasks. Sentence-BERT models specifically trained on STS or NLI datasets are shown to provide state-of-the-art performance. However, building these models for low-resource languages is not straightforward due to the lack of these specialized datasets. This work focuses on two low-resource Indian languages, Hindi and Marathi. We train sentence-BERT models for these languages using synthetic NLI and STS datasets prepared using machine translation. We show that the strategy of NLI pre-training followed by STSb fine-tuning is effective in generating high-performance sentence-similarity models for Hindi and Marathi. The vanilla BERT models trained using this simple strategy outperform the multilingual LaBSE trained using a complex training strategy. These models are evaluated on downstream text classification and similarity tasks. We evaluate these models on real text classification datasets to show embeddings obtained from synthetic data training are generalizable to real datasets as well and thus represent an effective training strategy for low-resource languages. We also provide a comparative analysis of sentence embeddings from fast text models, multilingual BERT models (mBERT, IndicBERT, xlm-RoBERTa, MuRIL), multilingual sentence embedding models (LASER, LaBSE), and monolingual BERT models based on L3Cube-MahaBERT and HindBERT. We release L3Cube-MahaSBERT and HindSBERT, the state-of-the-art sentence-BERT models for Marathi and Hindi respectively. Our work also serves as a guide to building low-resource sentence embedding models.
翻译:从vanilla BERT 模型中得出的句式代表在判决相似性任务方面效果不佳。在STS或NLI数据集方面经过专门培训的判刑-BERT模型显示,提供最新业绩。然而,由于缺乏这些专门数据集,为低资源语言建立这些模式并非直截了当。这项工作侧重于印度语和Marathi这两种低资源印度语。我们用合成NLI和STS数据包为这些语言培训句式-BERT模型。我们表明,在STSB微调之后的NLI预培训战略对于为印地语和Marathi生成高绩效的句式相似性模型是有效的。使用这一简单战略培训的vanilla BERT模型比使用复杂的培训的多语言 LaBSE 模型要强。这些模型根据下游文本分类和类似任务进行评估。我们用真实文本分类数据集来显示从合成数据培训中获得的嵌入内容是可概括的,因此,STSB微调整后的国家句释放战略对于为印地和MarathireSB的低资源语言提供了有效的培训战略。我们还提供了一种比较分析模型,分别用于嵌入的句式模型、BRERB IMRERERB 和LERERB的版本模型,并用于快速文本模型的B的B 和B IMERB 和B IMERB 。我们提供了一种比较模型的版本模型、B 和B IMERB 和B 。我们制的版本的版本的版本模型,我们使用的词汇模型,我们提供的B 。