L3Cube-MahaSBERT和HindSBERT:印地语和马拉地语的判刑BERT模式和基准基准BERT判刑代表 (L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi)

Sentence representation from vanilla BERT models does not work well on sentence similarity tasks. Sentence-BERT models specifically trained on STS or NLI datasets are shown to provide state-of-the-art performance. However, building these models for low-resource languages is not straightforward due to the lack of these specialized datasets. This work focuses on two low-resource Indian languages, Hindi and Marathi. We train sentence-BERT models for these languages using synthetic NLI and STS datasets prepared using machine translation. We show that the strategy of NLI pre-training followed by STSb fine-tuning is effective in generating high-performance sentence-similarity models for Hindi and Marathi. The vanilla BERT models trained using this simple strategy outperform the multilingual LaBSE trained using a complex training strategy. These models are evaluated on downstream text classification and similarity tasks. We evaluate these models on real text classification datasets to show embeddings obtained from synthetic data training are generalizable to real datasets as well and thus represent an effective training strategy for low-resource languages. We also provide a comparative analysis of sentence embeddings from fast text models, multilingual BERT models (mBERT, IndicBERT, xlm-RoBERTa, MuRIL), multilingual sentence embedding models (LASER, LaBSE), and monolingual BERT models based on L3Cube-MahaBERT and HindBERT. We release L3Cube-MahaSBERT and HindSBERT, the state-of-the-art sentence-BERT models for Marathi and Hindi respectively. Our work also serves as a guide to building low-resource sentence embedding models.

翻译：从vanilla BERT 模型中得出的句式代表在判决相似性任务方面效果不佳。在STS或NLI数据集方面经过专门培训的判刑-BERT模型显示,提供最新业绩。然而,由于缺乏这些专门数据集,为低资源语言建立这些模式并非直截了当。这项工作侧重于印度语和Marathi这两种低资源印度语。我们用合成NLI和STS数据包为这些语言培训句式-BERT模型。我们表明,在STSB微调之后的NLI预培训战略对于为印地语和Marathi生成高绩效的句式相似性模型是有效的。使用这一简单战略培训的vanilla BERT模型比使用复杂的培训的多语言 LaBSE 模型要强。这些模型根据下游文本分类和类似任务进行评估。我们用真实文本分类数据集来显示从合成数据培训中获得的嵌入内容是可概括的,因此,STSB微调整后的国家句释放战略对于为印地和MarathireSB的低资源语言提供了有效的培训战略。我们还提供了一种比较分析模型,分别用于嵌入的句式模型、BRERB IMRERERB 和LERERB的版本模型,并用于快速文本模型的B的B 和B IMERB 和B IMERB 。我们提供了一种比较模型的版本模型、B 和B IMERB 和B 。我们制的版本的版本的版本模型,我们使用的词汇模型,我们提供的B 。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

自然语言处理顶会NAACL2022最佳论文出炉！

专知会员服务

43+阅读 · 2022年6月30日

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日